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Despite  the  lack  of  formal  guidelines,  synthetic  speech  displays  are  used  in  a  grow¬ 
ing  variety  of  applications.  Telephone  information  systems  permitting  human- 
computer  interaction  from  remote  locations  are  an  especially  popular  implementation 
of  computer-generated  speech.  Currently,  human  factors  research  is  needed  to 
specify  design  characteristics  providing  usable  telephone  information  systems  as 
defined  by  task  performance  and  user  ratings.  Previous  research  used  nonintegrated 
tasks  such  as  transcription  of  phonetic  syllables,  words,  or  sentences  to  assess  task 
performance  or  user  preference  differences.  This  study  used  a  computer-driven 
telephone  information  system  as  a  real-time,  human-computer  interface  to  simulate 
applications  where  synthetic  speech  is  used  to  access  data.  Subjects  used  a  tele¬ 
phone  keypad  to  navigate  through  an  automated,  department-store  database  to  locate 
and  transcribe  specific  information  messages.  Because  speech  provides  a  sequen¬ 
tial  and  transient  information  display,  users  may  have  difficulty  navigating  through 
auditory  databases.  One  issue  investigated  in  this  study  was  whether  use  of  alter¬ 
nating  male  and  female  voices  to  code  different  levels  in  the  database  hierarchy 
would  improve  user  search  performance.  Other  issues  investigated  were  basic  in¬ 
telligibility  of  these  male  and  female  voices  as  influenced  by  different  levels  of 
speech  rate.  All  factors  were  assessed  as  functions  of  search  or  transcription  task 
performance  and  user  preference.  Analysis  of  transcription  accuracy,  search  effi¬ 
ciency  and  time,  and  subjective  ratings  revealed  an  overall  significant  effect  of 
speech  rate  on  all  three  groups  of  measures  but  no  significant  effects  for  voice  type 
or  coding  schem.e.  Results  were  used  to  recommend  design  guidelines  for  develop¬ 
ing  speech  displays  for  telephone  information  systems. 
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Despite  the  lack  of  formal  guidelines,  synthetic  speech  displays  are  used  in 
a  growing  variety  of  applications.  Telephone  information  systems  permitting 
human-computer  interaction  from  remote  locations  are  an  especially  popular  imple¬ 
mentation  of  computer-generated  speech.  Currently,  human  factors  research  is 
needed  to  specify  design  characteristics  providing  usable  telephone  information 
systems  as  defined  by  task  performance  and  user  ratings.  Previous  research  used 
nonintegrated  tasks  such  as  transcription  of  phonetic  syllables,  words,  or  sentences 
to  assess  task  performance  or  user  preference  differences.  This  study  used  a 
computer-driven  telephone  information  system  as  a  real-time,  human-computer 
interface  to  simulate  applications  where  synthetic  speech  is  used  to  access  data. 
Subjects  used  a  telephone  keypad  to  navigate  through  an  automated,  department- 
store  database  to  locate  and  transcribe  specific  information  messages.  Because 
speech  provides  a  sequential  and  transient  information  display,  users  may  have  dif¬ 
ficulty  navigating  through  auditory  databases.  One  issue  investigated  in  this  study 
was  whether  use  of  alternating  male  and  female  voices  to  code  different  levels  in  the 
database  hierarchy  would  improve  user  search  performance.  Other  issues  investi¬ 
gated  were  basic  intelligibility  of  these  male  and  female  voices  as  influenced  by  dif- 


ferent  levels  of  speech  rate.  All  factors  were  assessed  as  functions  of  search  or 
transcription  task  performance  and  user  preference.  Analysis  of  transcription  accu¬ 
racy,  search  efficiency  and  time,  and  subjective  ratings  revealed  an  overall  significant 
effect  of  speech  rate  on  all  groups  of  measures  but  no  significant  effects  for  voice 
type  or  coding  scheme.  Results  were  used  to  recommend  design  guidelines  for  de¬ 
veloping  speech  displays  for  telephone  information  systems. 
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Introduction 


Overview 


Modern  speech  research  involving  electronic  analysis  of  speech  began  with 
the  introduction  of  the  sound  spectrograph  developed  by  the  Bell  Telephone  Labora¬ 
tories  in  1946  and  Franklin  Cooper's  "pattern  playback”  machine  constructed  in  1950 
at  the  Haskins  Laboratories  (Pisoni,  1982).  Synthetic  speech  research  remained  the 
province  of  large  research  centers  until  the  late  1970's.  According  to  Bristow  (1984), 
the  innovation  of  Very  Large  Scale  Integration  (VLSI)  devices  in  1977  initiated  a 
“[synthetic]  speech  revolution”.  Reliable  performance  and  attractive  cost  of  VLSI's 
resulted  in  a  marked  increase  of  synthetic  speech  research  and  rapid  introduction  of 
synthetic  speech  displays  to  the  public  domain.  Figure  1  on  page  2  depicts  a  sum¬ 
mary  of  the  history  of  synthetic  speech  concept  and  hardware  development  (see  Ap¬ 
pendix  A  for  references  used  in  Figure  1). 

Commercial  developers  of  speech  synthesizers  did  not  wait  for  further  re¬ 
search.  Instead,  synthetic  speech  displays  were  implemented  in  absence  of  empir- 
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Figur*  1.  Research  Summary  of  Syntbatlc  Spaech  Concepts  and  Hardware  (From  Klatt,  1986) 


Introduction 


BHHBBBHWPBW8BBfW WtBmBWfrBKHK!  '■&  W  WWPWX  ranXiWWHJf  W  *v« antro*  vh  vk v*  gwt innwwwwv.v  vn  ^ 

I 

/ 


icatly  derived  guidelines  (Pisoni,  1982).  This  parallel  progression  of  research  and 
operational  implementation  continues  today.  And  it  is  text-to-speech  synthesizers 
that  promise  the  greatest  utility  for  applications  in  which  unrestricted  English  text 
must  be  converted  into  speech  such  as  information  retrieval  by  phone  (Slowiazcek 
and  Nusbaum,  1985;  Allen,  1981).  Current  speech  technology  gives  us  the  opportunity 
to  make  the  telephone  a  terminal,  thereby  taking  greater  advantage  of  a  device  one 
author  termed  the  “most  powerful  communications  tool  in  human  history”  (McHugh, 
1986).  Telephones  have  a  large  user  population  allowing  access  to  telephone-based 
information  systems  from  practically  anywhere.  Additionally,  those  who  might  be 
otherwise  intimidated  by  computers  may  more  freely  accept  using  a  familiar  and 
simple  device  such  as  the  telephone  as  a  terminal  for  computer-driven  information 
systems  (Labrador  and  Pai,  1984).  Vet,  we  have  few  if  any  guidelines  for  using  syn¬ 


thetic  speech  displays  in  telephone  information  systems. 


Purpose 


This  study  addressed  lack  of  guidelines  for  telephone-based  information  sys¬ 
tems  by  investigating  effects  of  voice  type  and  speech  rate  on  task  performance  of  a 
synthetic  speech  display.  Measures  of  intelligibility  and  search  efficiency  were  used 
to  detect  performance  differences  and  subjective  ratings  to  assess  user  preferences 
and  impressions.  A  major  question  of  this  study  was  whether  alternating  male  and 
female  synthetic  voices  as  an  informational  coding  scheme  improved  performance  in 
an  automated  database  as  compared  to  using  a  single  voice  to  present  all  informa¬ 
tion.  Related  to  this  issue  was  whether  one  voice  was  more  intelligible  than  the  other 
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for  pronouncing  key-words  and  sentences.  Finally,  this  study  examined  the  effect  on 
task  performance  and  user  preferences  of  increasing  speech  rate  beyond  optimum 
rates  demonstrated  in  previous  research.  Previous  study  results  suggest  a  perform¬ 
ance  optimum  of  180  wpm  for  DECtalk's  Perfect  Paul  voice  (Merva  and  Williges,  1987). 
This  study  continued  the  inquiry  of  optimum  rate  by  comparing  Perfect  Paul  to 
DECtalk's  Beautiful  Betty  voice,  also  found  highly  intelligible  in  earlier  research 
(Greene,  Manous,  and  Pisoni,  1984). 
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Literature  Review 


Methods  of  Speech  Synthesis 


For  this  study,  the  term  synthetic  speech  refers  to  speech  generated  entirely 
by  rule  or  algorithms  without  the  aid  of  an  original,  human  recording  (Simpson, 
McCauley,  Roland,  Ruth,  and  Williges,  1985).  Computers  also  use  other  methods  of 
speech  generation  such  as  digitized  speech  and  analysis-synthesis.  These  alternate 
methods  of  producing  synthetic  speech  may  feature  better  voice  quality  than  speech 
synthesized  by  rule  but  suffer  disadvantages  not  shared  by  rule-generated  speech. 


Digitized  Speech 


Speech  synthesis  by  rule  differs  from  digitized  speech  which  is  human  speech 
recorded  digitally  and  then  (usually)  transformed  into  a  more  compressed  data  for¬ 
mat.  Digit- 1  recording  processes  may  sample  human  speech  up  to  8000  or  more 
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times  per  second.  Fidelity  to  the  original  signal  and  hence,  intelligibility,  is  excellent 
at  such  rates  but  massive  amounts  of  storage  capability  are  required  to  store  the 
digitized  information  (Sanders  and  McCormick,  1987).  Storage  limitations  lead  to 
fixed-sized  vocabularies  which  must  also  be  updated  to  add  new  words.  Further¬ 
more,  since  digitized  speech  depends  on  an  original  source,  voice  variety  is  fixed  for 
a  recording.  To  use  additional  voices  in  a  digitized  speech  display  compounds  stor¬ 
age  problems  mentioned  earlier.  The  unlimited  variety  of  human  voices  available  for 
a  digitized  speech  display  also  imparts  unique  problems  of  variability  in  its  research 
(Simpson,  et  at.,  1985).  Research  replication  using  digitized  speed  would  require  ei¬ 
ther  the  same  voice  or  one  similar  as  selected  by  standard  voice  parameters.  Addi¬ 
tionally,  guideline  standardization  becomes  very  difficult  with  a  virtually  unlimited 
variety  of  human  voices  for  digital  recording  sources. 


Synthesis  by  Analysis 


Analysis-synthesis  methods  electronically  model  the  human  voice  mechanism 
to  produce  speech  sounds  (Sanders  and  McCormick,  1987).  The  source  speech  wave 
is  analyzed  along  certain  parameters  which  are  encoded  by  the  speech  analyzer  and 
stored.  This  method,  also  known  as  waveform  sampling,  differs  from  digitized  speech 
which  encodes  the  actual  speech  wave  and  requires  far  more  computer  memory  to 
store  speech  information  than  does  speech  synthesized  by  rule.  For  example, 
analysis-synthesis  using  a  common  analog-to-digital  conversion  requires  about 
64,000  bits  per  second  for  uncompressed  speech  (8000  samples  per  second  to  cap¬ 
ture  up  to  4000  Hertz  (Hz),  multiplied  by  8  bits  per  sample)  (Kaplan  and  Lerner,1985). 
The  same,  approximate  memory  requirements  used  by  digitized  speech  result  in  very 
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natural  (human-like)  speech.  However,  speech  produced  by  analysis-synthesis  tends 
to  sound  awkward  and  unnatural  because  of  a  lack  of  coarticulation  or  the  natural 
blending  and  modification  of  speech  sounds  caused  by  words  and  phonemes  that 
precede  and  follow  a  particular  sound.  A  phoneme  can  be  thought  of  as  the  smallest 
speech  sound  that  can  change  the  meaning  of  a  word,  but  the  phoneme  is  really  more 
a  theoretical  definition  than  a  precise  definition  of  the  spoken  segments  of  our  speech 
alphabet  (Kantowitz  and  Sorkin,  1983).  Some  (Simpson,  et  at.,  1985;  Flanagan,  1972) 
refer  to  analysis-synthesis  methods  as  digitized  speech  since  it  uses  a  digital  data- 
compression  technique. 

Speech  Synthesis  by  Rule 

Speech  generated  by  rule  uses  stored  dictionaries  of  elementary  speech 
segments  and  sets  of  rules  for  combining  them  and  for  stressing  particular  sounds 
or  words  that  produce  the  prosody  of  speech  (Sanders  and  McCormick,  1987). 
Prosody  is  the  rhythm  or  singsonq  quality  of  natural  speech.  Unlike  digitized  speech, 
rule-generated  or  synthetic  speech  requires  far  less  computer  memory  since  it 
makes  direct  translation  of  text  into  speech.  As  an  example,  formant  (resonant  fre¬ 
quency)  synthesis,  one  of  two  methods  used  to  synthesize  speech  by  rule,  requires 
a  data  rate  of  100  bits  per  second  based  on  a  typical  rate  of  12  phonemes  per  second 
with  each  phoneme  characterized  by  an  8-bit  code  (Kaplan  and  Lerner,  1985).  This 
memory  requirement  is  far  less  than  the  64,000  bits  per  second  required  by 
analysis-synthesis  or  digitized  speech  methods.  Formant  synthesis  simulates  the 
formants  or  resonances  of  the  vocal  tract  and  is  used  by  Digital  Equipment  Corpo¬ 
ration's,  DECtalk,  the  speech  synthesizer  used  in  this  study.  Linear  predictive  coding 
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(LPC),  the  other  rule-generated,  synthetic  speech  method  uses  a  mathematical  rep¬ 
resentation  of  the  vocal  tract  as  acoustic  tubes  to  produce  speech. 

Another  advantage  of  rule-generated,  synthetic  speech  possessed  by  neither 
digitized  nor  analysis-synthesis  speech  is  direct,  text  translation  which  provides  an¬ 
other  name  for  this  method,  text-to-speech.  Rule-based  speech  synthesizers  also 
feature  several  file  or  default  voice  types  making  standardization  of  research  and  re¬ 
sulting  guidelines  more  practical.  Consequently,  synthetic  speech  systems  do  not 
depend  on  human  speakers  for  new  vocabularies  as  do  digitized  speech  or 
analysis-synthesis  speech  which  must  use  the  same  human  speaker  in  order  to 
sound  consistent  (Simpson,  efa/.,  1985).  However,  the  best  synthetic  speech  has  yet 
to  achieve  a  voice  quality  comparable  to  the  best  of  other  methods.  This  limitation 
has  made  intelligibility  the  prime  variable  of  interest  in  most  synthetic  speech  re¬ 
search  with  many  related  issues  still  unresolved. 

Perception  of  Synthetic  Speech 

With  few  exceptions,  previous  research  has  consistently  demonstrated  syn¬ 
thetic  speech  to  be  less  intelligible  than  natural,  human  speech  except  under  opti¬ 
mum  conditions  of  low  noise  and  high  context  (Pisoni  and  Hunnicut,  1980;  Greene, 
et  al.,  1984).  This  lower  intelligibility  produces  two  effects:  either  the  information 
presented  by  synthetic  speech  is  not  heard  or  remembered  accurately,  or  the  addi¬ 
tional  effort  required  to  understand  it  interferes  with  other  tasks  being  carried  out  at 
the  same  time  (Cooper,  1987).  Less  clear  are  reasons  behind  the  lower  intelligibility. 
However,  researchers  usually  consider  problems  of  synthetic  speech  intelligibility  to 
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lie  in  human  processes  of  speech  perception  and  information  processing.  Luce, 
Feustel,  and  Pisoni  (1983)  have  suggested  comprehension  of  synthetic  speech  places 
a  greater  cognitive  load  on  the  listener  because  synthetic  speech  does  not  possess 
cues  present  in  natural,  human  speech.  Additionally,  Nusbaum,  Dedina  and  Pisoni, 
(1984)  postulate  a  possible  increase  in  short  term  memory  requirements.  Models  of 
human  information  processing  are  necessary  to  consider  problems  of  synthetic 
speech  intelligibility  in  the  context  of  short  term  memory. 
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Information  Processing  Theory 


Broadbent  (1958)  formulated  the  limited-capacity  channel  model  which  has 
proved  to  be  a  milestone  in  human  information  processing  research  (Kantowitz  and 
Sorkin,  1983).  As  depicted  in  Figure  2  on  page  10,  this  formulation  was  characterized 
by  four  features: 

•  The  whole  nervous  system  is  regarded  as  a  single  channel,  having  a  limit  to  the 
rate  at  which  it  can  transmit  information. 

•  The  limited-capacity  portion  of  the  nervous  system  is  preceded  and  protected  by 
a  selective  filter. 

•  This  "filter"  is  preceded  by  a  buffer  or  temporary  (short-term)  store  which  could 
hold  any  excess  information  arriving  by  channels  other  than  the  one  selected. 

•  A  long-term  store  kept  information  passing  through  the  limited-capacity  system 
in  the  form  of  a  record  of  the  conditional  probability  that  events  of  one  kind  are 
followed  by  events  of  another  kind. 


This  "reasonable  first  approximation  of  human  capabilities  in  most  tasks"  has  since 
been  modified  by  Broadbent  (1971,  1982)  and  challenged  by  some  (Kantowitz,  1974; 
Kinsbourne,  1981;  Lane,  1981). 

Most  challenges  to  Broadbent's  model  reveal  the  bottleneck  in  information 
processing  as  represented  by  the  limited-capacity  channel  is  not  as  straightforward 
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Figure  2.  Origin*!  limited-capacity  channel  model  (From  Broadbent.  1958) 
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in  practice  as  originally  thought.  The  basic  tenet  of  the  limited-capacity  model  is  that 
humans  can  transmit  information  only  at  a  finite  rate  with  output  of  one  stage  feeding 
directly  into  the  next  stage.  —  a  serial  processing  function.  More  recent  models  em¬ 
phasize  hybrid  processing,  which  use  both  serial  and  parallel  processing  in  the  same 
activity.  Parallel  processing  occurs  when  several  stages  simultaneously  have  access 
to  the  same  output  of  another  stage  (McCormick  and  Sanders,  1982).  Unlike  the 
limited-capacity  model,  hybrid  models  allow  information  to  enter  in  parallel  with  no 
bottleneck  (Kantowitz  and  Sorkin,  1983),  Bottlenecks  occur  only  when  responses 
must  be  emitted. 

Although  hybrid  models  are  still  undergoing  revisions  and  challenges  charac¬ 
teristic  of  empirical  methodology,  most  information  theorists  agree  to  existence  of 
short  term  memory  (STM)  —  a  function  critical  in  synthetic  speech  perception.  Re¬ 
search  efforts  of  Atkinson  and  Shiffrin  (1968)  suggest  STM  acts  not  only  as  a  reposi¬ 
tory  for  new  information  but  incorporates  a  working  memory  responsible  for  decision 
making,  problem  solving,  and  the  general  flow  of  information  within  the  iemory 
system.  Rehearsal,  the  overt  or  covert  repetition  of  information,  is  one  of  the  control 
processes  used  to  govern  functions  within  the  STM's  working  memory  by  maintaining 
information  within  STM.  Miller's  classic  paper,  “The  Magical  Number  Seven  Plus  or 
Minus  Two”  reported  research  that  demonstrated  people  can  remember  approxi¬ 
mately  seven  items  (Miller,  1956).  More  items  could  be  recalled  if  combined  into 
meaningful  “chunks",  but  the  number  of  chunks  (not  bits)  remained  approximately 
seven.  Miller's  view  is  still  held  to  be  generally  correct  with  further  research  dem¬ 
onstrating  memory  capacity  to  be  influenced  also  by  such  factors  as  acoustic  simi¬ 
larity  and  word  length  (Conrad  and  Hull,  1964;  Baddeley,  Thomson,  and  Buchanan, 
1975). 
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Research  findings  stemming  from  these  theories  hold  several  implications  for 
designers  of  human-computer  interfaces  which  use  synthetic  speech  as  a  display. 
In  a  series  of  three  experiments,  Luce,  et  at.  (1983)  compared  subjects'  recall  for 
synthetic  and  natural  lists  of  monosyllabic  words  using  the  MITalk  speech 
synthesizer.  From  their  results,  they  concluded  difficulties  in  perception  and  com¬ 
prehension  of  synthetic  speech  are  due  in  part  to  increased  processing  demands  in 
short-term  memory  (STM).  A  subsequent  study  by  Nusbaum,  et  at.  (1984)  investi¬ 
gated  two  opposing  hypotheses  for  these  increased  processing  demands  imposed 
on  STM.  The  first  hypothesis  held  synthetic  speech  to  be  simply  equivalent  to 
“noisy"  natural  speech.  That  is,  basic  cues  of  synthetic  speech  were  obscured, 
masked  or  physically  degraded  in  a  way  similar  to  that  of  natural  speech  in  noise. 
A  second,  counter  hypothesis  postulated  synthetic  speech  to  be  perceptually 
impoverished  relative  to  natural  speech  both  in  degree  and  kind.  Using  three  speech 
synthesizers  and  recordings  of  natural  voice  in  four  levels  of  noise,  Nusbaum,  et  at. 
had  83  undergraduates  listen  to  one  of  these  seven  speech  sources  speak  48 
consonant-vowel  (CV)  syllables.  Distribution  patterns  of  errors  and  confusions  by 
subjects  clearly  supported  the  hypothesis  that  synthetic  speech  is  not  perceived  like 
natural  speech  but  is  some  sense,  impoverished. 
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Synthetic  Speech  Dependent  Variables 


Performance  Measures 


Synthetic  speech  research  uses  performance  and  preference  measures  to 
assess  independent  variable  effects  on  perception  as  reflected  by  dependent  variable 
constructs.  Intelligibility,  the  fundamental  dependent  variable  construct,  is  defined 
operationally  as  the  percentage  of  speech  units  correctly  recognized  by  a  human 
listener  out  of  a  set  of  speech  hits  such  as  words,  sentences,  phonemes  or  the  per¬ 
ceptual  acoustical  features  of  those  phonemes  (Simpson  et  a/.,  1985).  Performance 
measures  of  intelligibility  for  synthetic  speech  research  were  borrowed  from  tradi¬ 
tional  communications  research  and  incfude:  Modified  Rhyme  Test  (MR T),  both  open 
and  closed  set  (Fairbanks,  1958;  House,  Williams,  Hecker,  and  Kryter,  1965);  Harvard 
Dsycho-Acoustic  Sentences  (Egan,  1948);  and  Haskins  Semantically  Anomalous  Sen¬ 
tences  (Nye  and  Gaitenby,  1974).  Standardization,  a  strong  advantage  of  these 
measures,  allows  researchers  to  compare  results  across  different  conditions  such  as 
performance  of  different  speech  synthesizers  or  different  researchers  to  compare 
study  findings.  However,  there  has  been  recent  criticism  of  these  measures  and  the 
MRT  in  particular. 

O'Malley  and  Caisse  (1987)  point  out  the  original  MRT  was  never  intended  to 

be  a  measure  of  human  speakers'  ability  to  produce  intelligible  speech  but  developed 

instead  to  measure  transmission,  not  several,  serious  deficiencies: 

•  MRT  results  are  more  unstable  with  computer  speech  than  with  human  speech 
because  of  a  strong  learning  curve  (training  effect)  associated  with  listening  to 
synthetic  speech. 
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•  The  MRT  sound  list  is  too  limited,  testing  only  300  monosyllables  thus  ignoring 
vowel  phonemes,  some  consonants  and  all  consonant  clusters. 

•  The  MRT  only  tests  isolated  words  and  does  not  reflect  that  in  computer  speech, 
consonants  occur  next  to  silence  less  than  5%  of  the  time.  Except  for  menus  as 
used  in  this  study,  most  speech  occurs  in  sentences  (also  used  in  this  study),  and 
putting  words  together  is  the  most  difficult  task  for  phoneme-to-speech  modules. 

•  Few  MRT  tests  reported  so  far  have  been  conducted  in  a  telephone  environment 
with  its  accompanying  noise  and  bandwidth  limitations  thus  ignoring  telephone 
involvement  in  90%  of  computer  speech  applications. 

•  Vendors  attempt  to  tune  their  systems  to  the  300  words  found  in  the  MRT. 

Sentences  also  have  their  advantages  and  disadvantages  when  used  in  intel¬ 
ligibility  studies.  Sentences  are  more  appropriate  for  research  purposes  when  used 
for  evaluating  telephone  information  systems  in  which  sentences  are  the  usual  unit 
of  information  of  interest.  However,  considerable  differences  in  systems  must  exist 
before  significant  differences  will  be  obtained  in  transcription  scores.  Psychological 
factors  (meaning,  context,  rhythm)  make  sentence  test  scores  difficult  to  analyze  and 
interpret.  For  extensive  testing,  a  large  number  of  sentences  is  required  since  the 
listener  will  remember  sentences.  Furthermore,  sentences  used  in  actual,  auditory 
displays  tend  to  be  unique  both  in  vernacular  and  context  because  of  the  particular, 
application  setting.  Consequently,  researchers  must  employ  systematic  sentence 
construction  techniques  in  order  to  generalize  results  and  attempt  derivation  of  global 
principles  of  sentence  usage  in  synthetic  speech  displays. 


Preference  Measures 

Preference  measures  have  been  either  inferred  from  performance  data  or  di¬ 
rectly  measured  using  self-report  measures  such  as  subjective  ratings  and  compar¬ 
isons.  Listener  impressions  of  naturalness,  pleasantness,  and  acceptability  as 
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compared  to  a  human  voice  are  the  usual  dimensions  polled.  Other  dimensions  such 
as  confidence  and  appropriateness  are  among  many  variations  devised  by  research¬ 
ers.  Rating-scale  types  have  included  Likert  scales  (one  to  seven  numerical  ratings), 
descriptively  anchored  scales  (“very  human"  as  opposed  to  “very  machine-like”), 
and  bipolar  scales  (“harsh”  versus  “soothing").  Open-set  queries  have  no 
researcher-provided  response  to  choose  from  and  though  the  data  is  less  quantifi¬ 
able,  it  often  proves  invaluable  to  the  researcher/designer.  Because  of  their  non- 
parametric  qualities,  subjective  rating  methods  are  difficult  to  analyze  with  parametric 
statistics.  There  have  been  attempts  to  relate  subjective  ratings  to  objective  meas¬ 
ures  of  speech  intelligibility  (Barnwell,  1982;  Voiers,  1977)  and  thus  impart  parametric 
attributes. 

One  such  measure  is  the  Diagnostic  Rhyme  Test  (DRT)  described  by  Voiers 
(1983).  Subjects  compare  relative  intelligibility  of  96  rhyming  word  pairs  that  differ 
by  a  single  acoustic  feature  or  attribute  in  the  initial  consonant.  The  six  attributes  are: 
voicing,  nasality,  sustention,  sibilation,  graveness,  and  compactness.  Widely  used 
within  the  Department  of  Defense  (DOD),  the  DRT  has  the  advantage  of  providing 
highly  reliable  and  repeatable  scores  that  can  be  used  to  make  comparisons  even 
among  systems  evaluated  at  different  times  (Schmidt-Nielsen,  1985).  However,  po¬ 
tential  users  of  voice  systems  dislike  the  DRT  because  they  lack  a  reference  frame 
by  which  to  evaluate  DRT  scores.  Instead,  they  prefer  “realistic”  tests  despite  the 
fact  that  such  tests  are  often  unrepeatable  because  results  are  confounded  by  such 
irrelevant  variables  as  noise,  distractions,  and  interruptions  (Schmidt-Nielsen,  1985). 

Pratt  (1987)  used  another  subjective  or  preference  measure,  Multi- 
Dimensional  Scaling  (MDS),  in  which  subjects  rate  the  dissimilarity  between  mem¬ 
bers  of  a  set  of  stimuli.  In  this  measure,  subjects  are  presented  with  pairs  of  stimuli 
and  instructed  to  assign  a  numerical  value  to  the  degree  of  dissimilarity  between 
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members  of  each  pair.  Data  reduction  techniques  produce  estimates  of  dissimilarity 
which  the  experimenter  is  then  required  to  interpret  intuitively.  Yet  another  prefer¬ 
ence  measure  is  the  Semantic  Differential  Scaling  (SDS)  developed  by  Osgood,  Suci, 
and  Tannenbaum  (1957).  In  the  SDS,  subjects  rate  stimuli  by  selecting  a  point  on  a 
numbered  scale  which  has  been  anchored  at  either  end  with  antonymous  adjectives. 
This  method  is  very  similar  to  the  bipolar,  seven-point  scales  used  in  this  study. 


Selected  Independent  Variables 


Voice  Type 


Early  speech  synthesizers  had  one  voice  unique  to  the  machine.  Now, 
synthesizers  are  capable  of  producing  an  almost  endless  variety  of  voices  by  manip¬ 
ulating  adjustable  parameters.  The  DECtalk  version  2.0  used  in  this  study  allows  ex¬ 
perimenter  control  over  32  different  parameters  as  well  as  possessing  9  default 
voices.  Consequently,  intelligibility  of  different  synthesizer  voices  as  compared  to 
each  other  has  been  a  natural,  research  focus.  Some  speech  synthesizers  have 
achieved  intelligibility  rates  of  100%  by  careful  manipulation  of  parameters  and  al¬ 
gorithms  for  certain  words.  Such  a  file  of  “customized  words”  is  called  an  exception 
dictionary.  Indeed,  in  certain  conditions  of  noise  or  distractions,  some  subjects  have 
rated  synthetic  speech  more  intelligible  than  natural  speech  citing  its  distinctive 
qualities  (Simpson,  1983;  Simpson  and  Williams,  1980).  In  a  study  designed  to  assess 
synthetic  speech  qualities,  Rosson  and  Cecala  (1985)  manipulated  four  parameters 
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of  head  size,  pitch,  richness  and  smoothness  using  sixteen  perceptual-scale  ratings 
to  derive  preference  measures.  However,  research  involving  methodical  manipu¬ 
lation  of  individual  synthetic  speech  parameters  to  evaluate  performance  is  still 
lacking. 

Instead,  most  research  has  used  the  default  voices  of  speech  synthesizers. 
Greene,  et  al.,  (1984)  compared  the  DECtalk,  version  1.8  to  earlier  evaluations  of  the 
Prose-2000,  version  8-84;  the  MITalk-79;  and  the  Type-n-Talk,  version  3-82.  Using  the 
open-  and  r.losed-set  Modified  Rhyme  Test,  the  Harvard  Psycho-Acoustic  sentences, 
and  the  Haskins  Semantically  Anomalous  Sentences,  they  found  the  DECtalk  unit  the 
most  intelligible  with  error  rates  roughly  half  the  size  of  error  rates  observed  in  ear¬ 
lier  studies.  Of  the  two  default  DECtalk  voices  evaluated,  Perfect  Paul  appeared  more 
intelligible  than  Beautiful  Betty.  Paul  and  Betty  are  male  and  female  voices  respec¬ 
tively,  which  according  to  listeners,  sound  “middle-aged  with  an  occasional  accent.” 
A  comparison  yet  to  be  made  and  a  focus  of  this  study  is  whether  these  two  most 
intelligible  voices,  Paul  and  Betty,  differ  significantly  in  intelligibility  for  sentences  as 
well  as  isolated  words  and  word  units. 

Speech  Rate 

Early  research  favored  a  speech  rate  of  approximately  150  wpm.  Simpson 
and  Marchionda-Frost  (1984)  using  a  Votrax  ML-1  synthesizer  investigated  three  word 
rates:  123,  156,  and  178.  Although  they  found  intelligibility  unaffected  by  speech  rate, 
subjects  reported  a  subjective  preference  for  156  wpm.  Lack  of  a  performance  effect 
on  intelligibility  resulted  from  Simpson  and  Marchionda-Frost  training  their  subjects 
to  100%  intelligibility  on  a  small,  highly-constrained  vocabulary  thus  maximizing 
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contextual  cues  (Slowiaczek  and  Nusbaum,  1985).  In  a  two-study  series,  Slowiaczek 
and  Nusbaum  (1984)  investigated  the  performance  effects  of  150  wpm  and  250  wpm 
on  intelligibility  using  a  Prose-2000  speech  synthesizer.  Their  findings  confirmed 
Simpson  and  Marchionda-Frost's  subject  preference  for  150  wpm.  Waterworth  and 
Lo  (1984)  in  investigating  the  effects  of  six  rates  (63,  82,  103,  121,  130,  and  150  wpm), 
found  messages  at  the  higher  rates  to  be  more  intelligible  though  no  differences  were 
statistically  significant.  Their  study  compared  natural  voice  to  four  synthesizers, 
three  of  which  were  text-to-speech  synthesizers:  Votrax  CDS-II,  Prose-2000  and  the 
Microspeech-2. 

Recent  research  findings,  however,  indicate  an  optimum  speech  rate  of  180 
words  per  minute  (wpm)  for  synthetic  speech,  a  rate  which  approximates  the  average 
for  conversational  speech.  This  optimum  was  for  speech  produced  by  the  DECtalk 
synthesizer's  Perfect  Paul  voice  (Merva  and  Williges,  1986;  Merva,  1987).  In  one 
study  (Merva  and  Williges,  1986),  a  rate  of  250  wpm  was  shown  to  be  significantly  less 
intelligible  than  a  180  wpm  rate.  !n  a  follow-on  study,  Merva  (1987)  compared  three 
speech  rates  of  150  wpm  (the  preferred  rate  reported  by  Simpson  and  Marchionda- 
Frost,  1984),  180  wpm,  and  210  wpm  and  again  found  performance  measures  indicat¬ 
ing  180  wpm  as  the  optimum  rate.  Both  studies,  however,  used  sentences  as  the 
audible  targets.  Sentences  provide  more  linguistic,  contextual  clues  than  single 
words  (Simpson  and  Williams,  1975),  but  single  words  or  small  phrases  are  neces¬ 
sary  for  menu  selection  choices  in  auditory  databases.  Further  investigation  of  rela¬ 
tively,  high  speech  rates  may  enable  increases  in  auditory  display  rates  allowing 
users  to  scan  messages  more  quickly  (O'Malley  and  Caisse,  1987).  Also,  no  study 
has  systematically  investigated  the  possible  interaction  of  voice  type  and  speech  rate 
on  intelligibility.  This  study  addressed  all  those  issues. 
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Information  Coding 


The  issues  investigated  in  this  study  pertain  not  only  to  the  intelligibility  of 
synthetic  speech  but  to  principles  of  auditory  displays  as  well.  Most  human  factors 
research  efforts  in  synthetic  speech  attempt  to  refine  and  expand  guidelines  for  dis¬ 
play  design  and  implementation.  The  starting  point  for  many  has  been  Deatherage's 
(1972)  comparison  table  for  auditory  and  visual  display  forms.  Though  quantitative 
guidelines  derived  from  research  findings  are  still  forthcoming,  designers  at  least  can 
remain  aware  to  problems  especially  those  revealed  in  information  processing 
studies.  As  an  example,  Kidd  (1982)  provides  several  problems  pertinent  to  auditory 
displays: 

•  A  user's  short  term  memory  storage  capacity  is  severely  limited  with  any  new 
input  decaying  rapidly  unless  constantly  rehearsed. 

•  Any  problem  solving,  decision  making  or  other  information  processing  severely 
restricts  the  user's  ability  to  carry  out  the  necessary  rehearsal  of  new  informa¬ 
tion. 

•  Synthetic  speech  (currently)  requires  more  effort  to  process  than  human  speech. 

•  The  user  cannot  control  the  rate  at  which  information  is  received. 

•  The  user  is  unable  to  rapidly  scan  the  menu  list  in  search  of  a  target  item  and 
instead  must  hear  each  item  individually. 

•  Possible  user  anxiety  may  result  from  not  knowing  how  many  menu  items  will 
have  to  be  remembered  during  an  interaction. 

Sanders  and  McCormick  (1987)  do  offer  tentative  guidelines  for  synthetic 
speech  display  implementation  (see  Table  1  on  page  20)  “gleaned”  from  these 
sources:  Simpson  and  Williams,  1980;  Thomas,  Rosson,  and  Chodorow,  1984;  and 
Wheale,  1980. 

System  designers  should  also  attempt  to  take  advantage  of  chunking  while 
remembering  the  limited  STM  capacity  by  providing  clues  about  the  classification 
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structure.  This  feature  enables  users  to  recognize  correct  options  the  first  time  it  is 
heard  and  should  be  optional  for  experienced  users.  A  form  of  chunking  found  ef¬ 
fective  is  insertion  of  pauses  at  (grammatically)  appropriate  points.  Nooteboom 
(1983)  used  pauses  in  this  manner  to  improve  performance  with  synthetic  speech  to 
a  level  virtually  identical  with  that  of  natural  speech.  Waterworth  (1983)  demonstrated 
a  similar  improvement  from  inserting  pauses  in  a  study  where  subjects  recalled  au¬ 
tomatically  generated  telephone  numbers. 

Guidelines  provided  in  Table  1  on  page  20  exemplify  qualitative  guidance 
provided  in  current  literature.  Few,  if  any,  collections  of  quantitative  standards  can 
be  found.  McKinley,  Anderson  and  Moore  (1982)  provided  an  exception  by  specifying 
two  performance  levels  used  as  criteria  by  the  Air  Force  Aerospace  Medical  Re¬ 
search  Laboratory  to  evaluate  synthetic  speech  system  prototypes.  Those  criteria 
require  a  Modified  Rhyme  Test  score  of  80%  correct  or  better  and  a  reaction  time  of 
250  milliseconds  (msec)  or  less.  Reaction  time  used  in  their  criteria  measured  time 
from  the  end  of  the  speech  presentation  until  subject  response.  This  differs  from  the 
system  response  time  measure  used  in  this  study.  However,  commercial  applica¬ 
tions  with  accuracy  ratings  of  80%  would  experience  little  success. 

Database  Organization 

Despite  the  large  amount  of  research  on  optimum  menu  configurations  for 
visual  databases,  very  little  information  exists  for  audible  databases.  Of  the  many 
issues  to  be  resolved  in  audible  databases,  perhaps  the  main  issue  is  the  one  of  or¬ 
ganization.  Short-term  memory  and  information  recall  makes  menu  breadth  and 
depth  crucial  to  the  display  designer.  Breadth  is  number  of  choices  at  each  menu 
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level  and  depth  is  the  number  of  menu  levels.  A  2x6  database  like  the  one  used  in 
this  study  has  2  choices  at  each  level  with  6  levels.  Snowberry,  Parkinson,  and 
Sisson  (1983)  found  subjects  performed  poorly  in  searches  using  2x6  visual  data¬ 
bases  and  postulated  three  reasons.  First,  subjects  might  have  forgotten  the  target. 
To  counter  this  factor,  Snowberry  et  al.  recommend  continuous  display  of  the  target, 
a  feature  of  this  study's  design.  Second,  subjects  may  forget  the  pathway  to  the  tar¬ 
get.  Since  this  study  assumed  infrequent  users,  the  database  was  designed  to  make 
learning  a  pathway  unnecessary.  Finally,  instead  of  associating  a  target  with  a  path 
of  options  (the  intended  searching  strategy  of  Snowberry  et  al.),  subjects  tended  to 
base  selections  of  options  on  perceived  associations  between  displayed  items  and 
the  target.  This  last  explanation  posed  no  problem  for  this  study  since  an  association 
between  menu  items  and  targets  was  the  intended  searching  strategy  for  the  data¬ 
base. 

An  additional  searching  or  navigational  aid  evaluated  in  this  study  was  use  of 
two  voices  to  speak  menus  in  an  alternating  fashion.  It  was  thought  use  of  alternating 
male  and  female  voices  would  enable  a  subject  to  distinguish  different  levels  of  the 
database  better  and  consequently,  perform  a  more  efficient  (faster)  search.  Addi¬ 
tionally,  this  voice  coding  scheme  would  also  assist  the  subject  tracking  the  depth 
of  menu  level  progression.  Kidd  (1982)  recommended  use  of  auditory  cues  such  as 
tones  or  different  voices  for  just  these  reasons. 
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Method 


Experimental  Design 


The  experimental  design  consisted  of  a  2x2x2  between  subjects  factorial  de¬ 
sign.  This  design  as  shown  in  Figure  3  on  page  24  contains  three  independent 
variables:  voice  type,  coding  scheme,  and  speech  rate. 


Voice  Type  and  Coding  Scheme 


Voice  type  and  coding  scheme  were  fixed-effects,  between  subject  variables. 
Two  levels  of  each  variable  were  fully  c  jssed  to  create  four  conditions  of  voice  type 
and  coding  scheme.  DECtalk's  file  voice,  Perfect  Paul,  represented  the  male  voice 
and  Beautiful  Betty,  the  female  voice.  In  half  of  the  conditions,  either  the  male  or  the 
female  voice  was  used  as  the  sole  voice  in  the  synthetic  speech  display.  The  re¬ 
maining  conditions  employed  alternating  voices  as  the  subject  progressed  through 
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Figure  3.  In  each  condition,  4  subjects  searched  for  16  targets. 
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the  database  levels.  In  one  condition,  the  female  voice  began  by  pronouncing  the 
main  menu  options  followed  by  the  male  voice  pronouncing  the  next  menu  level. 
This  alternating  female/male  voice  pattern  continued  to  the  final  database  level 


where  the  information  message,  a  complete  sentence,  was  spoken  by  the  beginning 
voice  —  in  this  case,  the  female  voice.  The  other  condition  of  alternating  voices  be¬ 
gan  and  ended  with  the  male  voice.  This  ensured  target  or  information  messages  in 
the  bottom  database  level  were  spoken  by  both  voice  types,  one  in  each  alternating 
voice  scheme. 

Speech  Rate 

Speech  rate  was  a  fixed-effects,  between  subjects  variable.  Two  levels  of  this 
variable  were  investigated:  180  words-per-minute  (wpm),  and  240  wpm.  Speech  rate 
affected  both  keywords  and  information  messages  which  were  complete  sentences 
(subject-verb-object).  Speech  rate  was  fully  crossed  with  the  four  conditions  of  voice 
type  and  coding  scheme  to  create  the  eight  treatment  combinations  depicted  in 
Table  2  on  page  26. 


Subjects 


This  study  employed  4  subjects  in  each  of  8  treatment  combinations  of  voice 
type,  coding  scheme  and  and  speech  rate  yielding  a  total  of  32  subjects.  Volunteers 
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Table  2.  List  of  Experimental  Conditions  for  32  Subjects 


Condition 

Number 

Treatment 

Number 

Voice 

Type 

Coding 

Scheme 

Speech 

Rate 

1 

1 

Male 

Same 

180 

2 

Male 

Same 

180 

3 

Male 

Same 

180 

4 

Male 

Same 

180 

2 

5 

Male 

Same 

240 

6 

Male 

Same 

240 

7 

Male 

Same 

240 

8 

Male 

Same 

240 

3 

9 

Female 

Same 

180 

10 

Female 

Same 

180 

11 

Female 

Same 

180 

12 

Female 

Same 

180 

4 

13 

Female 

Same 

240 

14 

Female 

Same 

240 

15 

Female 

Same 

240 

16 

Female 

Same 

240 

5 

17 

Male/Female 

Alternating 

180 

18 

Male/Female 

Alternating 

180 

19 

Male/Female 

Alternating 

180 

20 

Male/Female 

Alternating 

180 

6 

21 

Male/Female 

Alternating 

240 

22 

Male/Female 

Alternating 

240 

23 

Male/Female 

Alternating 

240 

24 

Male/Female 

Alternating 

240 

7 

25 

Female/Male 

Alternating 

180 

26 

Female/Male 

Alternating 

180 

27 

Female/Male 

Alternating 

180 

28 

Female/Male 

Alternating 

180 

8 

29 

Female/Male 

Alternating 

240 

30 

Female/Male 

Alternating 

240 

31 

Female/Male 

Alternating 

240 

32 

Female/Male 

Alternating 

240 

Method 


from  the  university  community  were  provided  monetary  compensation  for  their  par¬ 
ticipation.  Average  age  was  19.9  years  with  a  range  from  18  to  27. 


Experimental  Apparatus 


A  Beltone  109  Audiometer  was  used  to  assess  subjects'  gross  hearing  abili¬ 
ties.  For  the  experimental  task,  Digital  Equipment  Corporation's  (DEC)  DECtalk 
speech  synthesizer  provided  the  speech  display.  Task  presentation  and  data  re¬ 
cordings  were  executed  by  a  VAX  11/750  mainframe  system  connected  to  two  DEC 
VT220  terminals  using  a  specially  developed  PASCAL  program.  The  experimenter 
station  used  one  VT220  terminal  (visual  display  unit  with  separate  keyboard)  to  ini¬ 
tialize  and  monitor  each  session.  The  subject's  station  also  used  a  VT220  terminal 
coupled  with  a  touch-tone  speaker  phone  (Panasonic  VA-8205).  The  telephone's 
speaker  —  not  the  handset  —  presented  the  speech  display.  The  volume  control  was 
taped  over  to  provide  a  constant  volume  level  for  all  subjects.  A  JVC  GX-S700  video 
camera  provided  visual  and  aural  monitoring  of  subjects  to  video  monitors  located 
at  the  experimenter's  station  in  an  adjacent  room.  Audio  or  video  recordings  of  ex¬ 


perimental  sessions  were  not  made. 


Information  Database 


Organization  and  Keywords 

The  database  constructed  for  this  study  contained  information  about  typical 
department  store  items.  The  database  was  a  2x6  hierarchy  containing  6  levels  of 
menus  with  each  menu  having  2  items  (see  Figure  4  on  page  29  and  Figure  5  on 
page  30).  Each  menu  item  or  keyword  served  as  a  title  for  a  group  of  related  items 
(e.g.,  “entertainment”  is  a  keyword  for  “music"  and  “books”).  Keywords  were  se¬ 
lected  to  allow  grouping  of  store  items  into  sets  of  2,  4,  8,  16,  and  32,  and  64  keywords 
for  each  menu  level. 

Preliminary  study  efforts  attempted  to  ensure  sets  of  store  items  were  rea¬ 
sonably  distinct  from  each  other  to  reduce  searching  errors  due  to  semantics  or  am¬ 
biguous  keywords.  Keywords  found  by  the  preliminary  study  to  be  grossly 
unintelligible  or  consistently  misconstrued  were  discarded  and  replaced  with  syno¬ 
nyms  or  similar  items.  Manual  phoneme  or  stress  polishing  was  not  done  to  enhance 
DECtalk  pronunciation.  However,  compound  words  were  entered  in  an  exception 
dictionary  with  hyphens  at  the  appropriate  location  to  reduce  mispronunciation  (i.e., 
basket-ball,  sweat-pants).  Finally,  contextual  clues  were  provided  by  the  department 
store  scenario  to  help  subjects  recognize  keywords  in  both  menu  levels  and  infor¬ 
mation  messages. 
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the  database  levels.  In  one  condition,  the  female  voice  began  by  pronouncing  the 
main  menu  options  followed  by  the  male  voice  pronouncing  the  next  menu  level. 
This  alternating  female/male  voice  pattern  continued  to  the  final  database  level 
where  the  information  message,  a  complete  sentence,  was  spoken  by  the  beginning 
voice  —  in  this  case,  the  female  voice.  The  other  condition  of  alternating  voices  be¬ 
gan  and  ended  with  the  male  voice.  This  ensured  target  or  information  messages  in 
the  bottom  database  level  were  spoken  by  both  voice  types,  one  in  each  alternating 
voice  scheme. 


Speech  Rate 


Speech  rate  was  a  fixed-effects,  between  subjects  variable.  Two  levels  of  this 
variable  were  investigated:  180  words-per-minute  (wpm),  and  240  wpm.  Speech  rate 
affected  both  keywords  and  information  messages  which  were  complete  sentences 
(subject-verb-object).  Speech  rate  was  fully  crossed  with  the  four  conditions  of  voice 
type  and  coding  scheme  to  create  the  eight  treatment  combinations  depicted  in 
Table  2  on  page  26. 


Subjects 


This  study  employed  4  subjects  in  each  of  8  treatment  combinations  of  voice 
type,  coding  scheme  and  and  speech  rate  yielding  a  total  of  32  subjects.  Volunteers 


Method 


I 


1 


Table  2.  List  of  Experimental  Conditiona  for  32  Subjects 


Condition 

Number 

Treatment 

Number 

Voice 

Type 

Coding 

Scheme 

Speech 

Rate 

1 

1 

Male 

Same 

180 

2 

Male 

Same 

180 

3 

Male 

Same 

180 

4 

Male 

Same 

180 

2 

5 

Male 

Same 

240 

6 

Male 

Same 

240 

7 

Male 

Same 

240 

8 

Male 

Same 

240 

3 

9 

Female 

Same 

180 

10 

Female 

Same 

180 

11 

Female 

Same 

180 

12 

Female 

Same 

180 

4 

13 

Female 

Same 

240 

14 

Female 

Same 

240 

15 

Female 

Same 

240 

16 

Female 

Same 

240 

5 

17 

Male/Female 

Alternating 

180 

18 

Male/Female 

Alternating 

180 

19 

Male/Female 

Alternating 

180 

20 

Male/Female 

Alternating 

180 

6 

21 

Male/Female 

Alternating 

240 

22 

Male/Female 

Alternating 

240 

23 

Male/Female 

Alternating 

240 

24 

Male/Female 

Alternating 

240 

7 

25 

Female/Male 

Alternating 

180 

26 

Female/Male 

Alternating 

180 

27 

Female/Male 

Alternating 

180 

28 

Female/Male 

Alternating 

180 

8 

29 

Female/Male 

Alternating 

240 

30 

Female/Male 

Alternating 

240 

31 

Female/Male 

Alternating 

240 

32 

Female/Male 

Alternating 

240 

3 

i 

A 

A 
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A 
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from  the  university  community  were  provided  monetary  compensation  for  their  par¬ 
ticipation.  Average  age  was  19.9  years  with  a  range  from  18  to  27. 


Experimental  Apparatus 


A  Beltone  109  Audiometer  was  used  to  assess  subjects'  gross  hearing  abili¬ 
ties.  For  the  experimental  task,  Digital  Equipment  Corporation's  (DEC)  DECtalk 
speech  synthesizer  provided  the  speech  display.  Task  presentation  and  data  re¬ 
cordings  were  executed  by  a  VAX  11/750  mainframe  system  connected  to  two  DEC 
VT220  terminals  using  a  specially  developed  PASCAL  program.  The  experimenter 
station  used  one  VT220  terminal  (visual  display  unit  with  separate  keyboard)  to  ini¬ 
tialize  and  monitor  each  session.  The  subject's  station  also  used  a  VT220  terminal 
coupled  with  a  touch-tone  speaker  phone  (Panasonic  VA-8205).  The  telephone's 
speaker  —  not  the  handset  —  presented  the  speech  display.  The  volume  control  was 
taped  over  to  provide  a  constant  volume  level  for  all  subjects.  A  JVC  GX-S700  video 
camera  provided  visual  and  aural  monitoring  of  subjects  to  video  monitors  located 
at  the  experimenter's  station  in  an  adjacent  room.  Audio  or  video  recordings  of  ex¬ 
perimental  sessions  were  not  made. 
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Information  Database 

Organization  and  Keywords 

The  database  constructed  for  this  study  contained  information  about  typical 
department  store  items.  The  database  was  a  2x6  hierarchy  containing  6  levels  of 
menus  with  each  menu  having  2  items  (see  Figure  4  on  page  29  and  Figure  5  on 
page  30).  Each  menu  item  or  keyword  served  as  a  title  for  a  group  of  related  items 
(e.g.,  “entertainment"  is  a  keyword  for  “music”  and  “books”).  Keywords  were  se¬ 
lected  to  allow  grouping  of  store  items  into  sets  of  2,  4,  8,  16,  and  32,  and  64  keywords 
for  each  menu  level. 

Preliminary  study  efforts  attempted  to  ensure  sets  of  store  items  were  rea¬ 
sonably  distinct  from  each  other  to  reduce  searching  errors  due  to  semantics  or  am¬ 
biguous  keywords.  Keywords  found  by  the  preliminary  study  to  be  grossly 
unintelligible  or  consistently  misconstrued  were  discarded  and  replaced  with  syno¬ 
nyms  or  similar  items.  Manual  phoneme  or  stress  polishing  was  not  done  to  enhance 
DECtalk  pronunciation.  However,  compound  words  were  entered  in  an  exception 
dictionary  with  hyphens  at  the  appropriate  location  to  reduce  mispronunciation  (i.e„ 
basket-ball,  sweat-pants).  Finally,  contextual  clues  were  provided  by  the  department 
store  scenario  to  help  subjects  recognize  keywords  in  both  menu  levels  and  infor- 
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mation  messages. 


Method 


28 


Char 


living  Room 


Furniture 


Furnishings 


Household 


Sofa 


Beds 


Bedroom 
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Laundry 


Major 


Appliances 


Kitchen 


Communications 


Small 


Food 


Stereo 


Equipment 


Music 


Entertainment 


Recordings 


Strings 


Instruments 


Brass 


Desserts 


Cooking 


Books 


Meat 


Team 


Sports 


Individual 


Recliner 


Bean  Bag 


Sectional 


Love  Seat 


Bunk 


Water 


Hope 


Dresser 


Washer 


Dryer 


Stove 


Refrigerator 


Telephone 


Answer.  Mach. 


Blender 


Toaster 


Turntable 
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Trumpet 
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Candy 
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Figure  4.  Diagram  of  the  HoueehokMielf  of  the  2x6  hierarchical  database. 
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Fashion 


Winter 


Summer 

Business 

Herring  Bone 

Jackets 

Tweed 

Men's 

Sweaters 

Tops 

Jerseys 

Sportswear 

Jeans 

Pants 

Clothing 

Sweatpants 

Formal 

Wedding 

Cocktail 

Dresses 

Knit 

Casual 

Denim 

Women's 

Silk 

Blouses 

Cotton 

Separates 

Pleated 

Skirts 

Straight 

Gold 

Watches 

Chains 

Metal 

Cufflinks 

Silver 

Broaches 

Jewelry 

Pins 

Diamond 

Rings 

Gems 

Earrings 

Pearl 

Accessories 

Necklaces 

Face 

Eye  Shadow 

Mascara 

Makeup 

Cream 

Hand 

Nail  Polish 

Cosmetics 

Musk 

Masculine 

Spice 

Fragrances 

Floral 

Feminine 

Oriental 

Figure  S.  Diagram  of  the  Fashion-half  of  the  2x6  hierarchical  database. 


Informatin  Messages 


Each  of  64  bottom-level  keywords  functioned  as  a  title  for  an  information 
message.  The  messages  were  of  four  types:  Location,  Price,  Availability,  or  Infor¬ 
mation.  Each  message  had  the  form  of  adjective-noun-verb-preposition-object  (i.e., 
“Silk  blouses  are  sold  for  half-price.’’).  As  shown  in  Table  3  on  page  32,  using  a  re¬ 
stricted  set  of  verbs  and  prepositions  and  non-varying  sentence  construction  stand¬ 
ardized  the  information  message  format.  This  standard  format  made  the  middle 
section  of  each  message  familiar  to  the  subject  and  reduced  linguistical,  context 
clues  as  to  the  meaning  of  the  message.  Consequently,  the  first  and  last  two  words 


in  each  message  could  be  scored  both  collectively  and  separately  for  transcription 
accuracy  (Merva,  1987).  Other  guidelines  used  to  construct  information  sentences 
are  provided  in  Table  4  on  page  33. 
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Experimental  Protocol 


Preliminaries 


The  experimental  session  began  with  each  subject  reading  and  signing  the 
informed  consent  form  (see  Appendix  B).  Consenting  subjects  then  completed  a  de¬ 
mographic  survey  form  (see  Appendix  C).  Next,  the  experimenter  administered  a 
hearing  test  to  each  subject  to  eliminate  data  from  “hard  of  hearing”  subjects 
(American  National  Standards  Institute,  1973).  Hearing  test  criterion  was  the  hearing 
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Table  3.  Information  Messages  Format 


Information  Type 

Format 

LOCATION: 

Adjective  subject 

PRICE: 

Adjective  subject 

Adjective  subject 

AVAILABILITY: 

Adjective  subject 

INFORMATION: 

Adjective  subject 

Adjective  subject 


is/are 

near 

in 

on 

object 

is/are  reduced 

for 

by 

object 

is/are  sold 

for 

by 

object 

is/are  available 

with 

at 

by 

in 

object 

is/are  offered 

with 

for 

to 

object 

is/are  required 

within 

for 

object 

Table  4.  Rules  Used  for  Developing  Information  Sentences 


1.  Each  information  message  was  a  single  sentence. 

2.  Standard  syntax  was  used  for  each  sentence. 

3.  Cliches,  proverbs,  and  other  stereotyped  constructions  were  avoided. 

4.  Four  message  types  of  information,  location,  availability,  and  price 
were  required. 

5.  Only  four  words  were  scored  in  each  sentence. 

6.  Scored  words  were  never  duplicated  in  any  other  information  message. 

7.  No  proper  nouns  were  allowed  as  scored  words. 


of  two  out  of  three  pulsed  tones  at  26dB  between  750  and  4000  hertz  (hz).  Subjects 
unable  to  pass  the  test  were  still  allowed  to  participate,  but  their  data  were  discarded. 
This  occurred  for  one  subject  in  this  study.  After  the  hearing  test,  the  experimenter 
used  the  speakerphone's  auto-dial  feature  to  call  the  department  store  information 
system.  The  synthesizer  spoke  an  introduction  and  instructions  as  the  subject  read 
along  using  a  written  guide  (see  Appendices  D  and  E).  The  voice  spoke  at  either  180 
wpm  or  240  wpm  reflecting  the  subject's  assigned  treatment  condition.  The 
synthesizer  used  the  dominant  voice  for  the  condition  experienced  by  the  subject. 
For  conditions  with  one  voice,  the  dominant  voice  was  the  same  voice  as  heard  by 
the  subject  in  experimental  trials.  In  conditions  employing  an  alternating  voice  cod¬ 
ing  scheme,  the  dominant  voice  was  the  voice  that  spoke  the  main  (or  first)  menu 
level  and  the  information  message.  When  the  subject  completed  reading  and  listen¬ 
ing  to  the  instructions,  the  experimenter  played  a  video  tape  which  repeated  the  in¬ 
structions  and  demonstrated  a  target  search  through  the  database. 

Following  the  instruction  tape,  the  experimenter  answered  questions  and  em¬ 
phasized  a’'y  differences  between  the  demonstration  and  conditions  the  subject  was 
to  experience.  The  experimenter  then  depressed  the  space  bar  on  the  subject's 
keyboard  causing  the  synthesizer  to  review  keypad  functions  available  to  the  subject 
(see  Appendix  F).  Again,  the  synthesizer  used  the  dominant  voice  at  the  subject's 
assigned  rate. 

Experimental  Session 

The  subject  then  began  a  practice  series  of  two  trials  by  using  the 
speakerphone  to  call  the  department  store  information  system  as  done  earlier.  The 
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system  “answered"  using  the  dominant  synthetic  voice  to  offer  a  brief  review  of  task 
instructions.  Following  this  review  or  a  four-second  timeout  if  instructions  were  not 
selected  by  the  subject,  the  first  practice  target  was  displayed  on  the  computer  ter¬ 
minal's  display  screen  for  15  seconds.  The  first  sample  target  message  read,  “What 
is  the  information  about  golf  books?”  At  the  end  of  the  15-second  display,  a 
“ready...”  message  displayed  on  the  screen  below  the  target  indicated  the  search 
was  about  to  begin.  The  target  was  displayed  on  the  computer  screen  throughout  the 
target  search.  Two  seconds  after  the  ready  message,  a  "Begin  the  search"  message 
was  displayed  on  the  screen  and  the  information  system  spoke  the  first  level  menu. 

When  the  subject  heard  a  keyword  relating  to  the  target,  that  keyword  was 
selected  by  pressing  the  key  on  the  telephone  keypad.  The  system  then  re¬ 
sponded  by  speaking  the  next  lower  menu  level  of  keywords  related  to  the  keyword 
previously  selected.  If  subjects  wanted  to  backup  a  menu  level,  they  pressed  the 
key.  To  return  to  the  main  menu,  subjects  used  the  “Q”  key.  In  this  fashion, 
subjects  navigated  through  the  audible  database  until  finding  the  store  item  displayed 
in  the  target  message  on  the  display  screen.  If  the  subject  arrived  at  an  incorrect 

store  item,  the  system  would  speak,  "At  store  item, _ ;  continue  search.”  To 

continue  the  search,  subjects  depressed  the  or  “0”  key. 

Upon  subject  selection  of  a  correct,  bottom-level  item,  the  information  system 
requested  subjects  to  depress  the  “2"  key  to  hear  the  information  message  related 
to  the  storo  item.  After  speaking  the  information  message,  the  computer  screen 
displayed  a  message  requesting  the  subject  to  transcribe  the  information  message 
just  heard.  This  request  replaced  the  target  message  displayed  during  the  search. 
There  was  no  time  limit  for  the  transcription  task  with  subjects  encouraged  to  tran¬ 
scribe  their  best  guess  if  unsure  of  their  answer.  After  typing  in  the  answer,  a  series 
of  three  computer-displayed  messages  prompted  subjects  for  subjective  ratings  (see 
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Appendix  H).  The  first  asked  subjects  to  rate  the  certainty  of  their  transcription  on  a 
scale  of  1  (very  uncertain)  to  7  (very  certain).  A  second  bi-polar  adjective  scale  fol¬ 
lowed  the  first  and  asked  subjects  to  rate  the  difficulty  in  understanding  the  message. 
Again  the  scale  was  from  1  (very  difficult  to  7  (very  easy).  Finally,  subjects  rated 
difficulty  in  locating  the  store  item  on  a  scale  of  1  (very  difficult)  to  7  (very  easy). 

After  subjects  completed  the  third  rating,  a  second  practice  target  appeared 
on  the  computer  screen  and  as  before,  fifteen  seconds  later,  the  search  began  by 
speaking  the  first  menu  level.  Following  this  second  search,  the  system  hung  up  and 
the  experimenter  offered  subjects  a  rest  period.  Following  the  rest  period,  subjects 
began  the  main  experimental  session  by  calling  the  information  system  as  they  had 
done  for  the  two  practice  searches.  Searches  proceeded  in  the  same  manner  as 
practice  trials  until  the  subject  found  eight  targets.  After  completing  the  third  seven- 
point  scale  rating  for  the  eighth  target,  a  “TAKE  A  BREAK!”  message  appeared  for 
one  minute  before  another  message  appeared  instructing  subjects  to  press  the 
spacebar  to  continue.  Following  the  break,  subjects  completed  the  remaining  eight 
target  searches. 

After  completion  of  16  trials,  subjects  used  the  computer  terminal  to  answer 
7  additional  questions  about  the  telephone  information  system  in  the  form  of  seven- 
point  ratings  (see  Appendix  H).  Then  the  experimenter  conducted  a  structured 
interview  of  17  to  21  questions  concerning  subject  impressions  of  the  synthetic 
voice(s)  used  in  the  display  and  the  display  application  in  general  (see  Appendix  I). 
Subjects  receiving  an  alternating  voice  condition  were  asked  four  questions  more  (21 
total)  concerning  differences  between  the  two  voices  used  in  the  display.  Finally 
each  subject  was  debriefed  on  the  experiment's  purpose,  paid  and  thanked  for  their 
participatic  i.  Figure  6  on  page  38,  illustrates  the  major  portions  of  each  exper¬ 
imental  session  with  average  times  shown  for  each  portion.  Total  time  for  the  ex- 
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perimental  session  ranged  from  one  hour,  fifteen  minutes  to  one  hour,  forty-five 


minutes,  with  the  average  session  time  per  subject  lasting  approximately  one  hour, 


thirty  minutes. 


Dependent  Measures  and  Data  Collection 


The  experimental  task  as  experienced  by  a  subject  was  actually  two  tasks  in 


series:  a  search  task  of  finding  a  correct  store  item  followed  by  a  message  tran¬ 


scription  task  of  typing  the  information  message  into  the  computer.  If  a  subject  ar¬ 


rived  at  an  incorrect  store  item,  the  message,  "At  store  item. 


;  continue 


search.",  prompted  the  subject  to  continue  the  search  until  reaching  the  correct  item. 


Consequently,  since  searches  for  a  specific  store  item  by  all  subjects  eventually 


ended  at  the  same  store  item,  this  allowed  direct  comparison  of  search  task  meas¬ 


ures  among  subjects.  Likewise,  since  all  subjects  heard  the  same,  16  information 


messages,  intelligibility  scores  could  be  directly  compared  as  well. 


All  measures  were  in  the  form  of  keystrokes  on  the  VT220  terminal  keyboard 


or  keypresses  on  the  telephone  keypad.  Both  keystroke  and  keypresses  were  re¬ 


corded  by  a  metering  package  in  the  software  program  for  the  experimental  session. 


Below  are  4  objective  (performance)  and  10  subjective  (preference)  measures  used 


to  assess  effects  of  the  independent  variables. 


Objective  Measures 


target  search  time  ratio 


target  search  efficiency  ratio 
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WELCOME  AND  ORIENTATION  (-15  mins) 

Informed  Consent 
Subject  Information  Questionaire 

_ Hearing  Test _ ' 

INSTRUCTIONS  AND  PRACTICE  (~  20  mins) 

Introduction  (audio  -  written) 

Instructions  (audio  -  written) 

Video  Instructions 

Telephone  Key  Instructions  (audio  -  written) 
Subject  Recapituation  of  Instructions 

_ Practice  Targets  (n«2) 

EXPERIMENTAL  TASK  (~  30  mins) 

8  Experimental  Targets 
Target  Search 
Transcription 
Target  ratings 
Break  (minimum  1  minute) 

8  Experimental  Targets 
Target  Search 
Transcription 
Target  ratings 

_ Post  Experimental  Ratings _ 

POST  EXPERIMENTAL  SESSION  ('15  mins) 

Debriefing 

* 

Payment  and  Dismissal 


Figure  6.  Outline  of  Experimental  Seeaion  Eventa. 
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•  invalid  keypresses 


•  message  transcription  errors  —  strict  and  synonym 

Subjective  Measures 

After  each  target  search: 

•  message  transcription  certainty  rating 

•  message  understanding  difficulty  rating 

•  search  difficulty  rating 

After  completion  of  all  16  target  searches: 

•  system  ease  of  use 

•  voice  intelligibility 

•  voice  naturalness 

•  voice  speech  rate 

•  system  response  time 

•  system  input  timeout 

•  menu  organization 

Search  Task  Measures 

Target  search  time  ratio  is  an  average  ratio  score  of  a  subject's  total  search 
time  compared  to  the  minimum  search  time  taken  by  an  expert  user.  A  search  time 
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ratio  of  1.0  would  indicate  an  “expert"  performance  by  a  subject.  Expert  search  time 
was  determined  by  running  a  real-time  computer  simulation  of  expert  searches  under 
conditions  experienced  by  subjects.  Each  simulation  run  score  was  a  combination 
of  system  time  requirements  and  0.57  seconds  for  each  menu  level  selection.  This 
selection  time  was  taken  from  the  American  Institutes  for  Research  Data  Store 
(Munger,  Smith,  and  Payne,  1962)  for  an  expert  user  pressing  a  pushbutton  when 
cued.  Ever  /  time  a  selection  was  required  in  a  simulation  run,  this  value  was  used. 
System  time  requirement  included  three  values:  system  response  times  to  user  in¬ 
puts  (set  at  0  seconds  for  all  8  treatment  conditions),  system  timeouts  or  the  amount 
of  time  provided  to  users  for  keypad  input  (set  at  4  seconds  for  all  8  treatment  con¬ 
ditions)  and  the  minimum  amount  of  time  the  system  required  to  speak  the  necessary 
menu  items. 

However,  despite  setting  the  input  timeout  parameter  at  4  seconds,  the  actual 
timeout  varied  by  as  much  as  +.0-5  seconds.  This  variability  was  a  function  of  system 
software.  System  speech,  the  third  facet  of  system  time  requirement,  also  varied  as 
a  function  of  speech  rate  and  voice  type.  Because  of  these  small  variabilities  in 
DECtalk  system  time  requirements  and  system  response  times,  average  expert 
scores  were  obtained.  As  in  the  overall  experimental  design,  four  real-time  simu¬ 
lation  runs  per  condition  were  conducted  to  achieve  an  average  expert  score  for  a 
particular  condition.  An  average  search  time  score  for  each  condition  was  then 
combined  with  the  average  search  time  for  subjects  in  the  same  condition  to  form  the 
search  time  ratio  score  for  each  subject. 

Target  search  efficiency  ratio  is  a  score  of  subject  search  efficiency  formed 
by  the  ratio  of  minimum  number  of  keywords  required  to  be  heard  in  order  to  reach 
a  store  item  to  the  actual  number  of  keywords  heard  by  a  subject.  As  shown  in 
Table  5  on  page  42,  target  store  items  were  symmetrically  distributed  among  number 
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of  keywords  required.  The  total  number  of  keywords  each  subject  heard  for  all  16 
searches  was  combined  with  the  minimum  number  of  keywords  required  for  all  16 
searches.  As  in  target  search  time  ratios,  a  target  search  efficiency  ratio  score  of  1.0 
would  indicate  perfect  performance  by  a  subject. 

Invalid  keypresses  are  keypresses  inappropriate  at  the  time  of  occurrence. 
Either  the  key  is  not  defined  or  a  defined  key  is  depressed  at  an  inappropriate  time 
such  as  depressing  the  “2"  key  before  reaching  an  information  message.  The 
measure  used  in  this  study  was  the  average  number  of  invalid  keypresses  per 
search. 

Transcription  Task  Measures 

Message  transcription  errors  as  calculated  in  this  study  is  a  measure  based 
on  a  design  used  by  Merva  and  Williges  (1987)  to  investigate  effects  of  speech  rate, 
message  repetition,  and  information  placement  on  synthesized  speech  intelligibility. 
In  their  scheme,  the  beginning  and  end  two  words  of  each  transcription  are  checked 
for  accuracy.  One  point  is  given  for  each  correct  word.  Under  “strict”  scoring,  words 
in  the  response  must  be  exactly  the  same  as  words  in  the  spoken  message  to  be 
counted  as  correct.  Spelling  errors  were  not  counted  as  incorrect  as  long  as  the 
word  remained  phonetically  correct.  “Synonym”  scoring  allows  synonyms  for  the 
spoken  words  to  be  accepted  as  correct.  Subject  responses  in  this  study  were 
scored  under  both  rules.  Synonym  scoring  allows  for  the  variability  in  human  as¬ 
similation  of  spoken  words.  If  a  subject  transcribes  the  word,  “luggage",  for  the 
spoken  word,  “baggage”,  one  cannot  determine  if  this  is  due  solely  to  intelligibility 
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Table  5.  Minimum  Number  of  Keyword*  Required 


Keywords  Number  of  Target 

Heard  Store  Items 


6  1 

7  1 

8  3 

9  6 

10  3 

11  1 

12  1 
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or  includes  assimilations  effects  of  comprehension.  Synonym  scoring  effects  a 
compromise  for  this  dilemma  by  allowing  for  contextually  correct  answers. 

Hypotheses 

The  general  null  hypotheses  were  different  levels  of  each  independent  vari¬ 
able  or  any  combination  of  independent  variables  would  have  no  effect  on  the  value 
of  any  dependent  measure.  Alternative  hypotheses  contended  an  effect  but  did  not 
suggest  a  direction.  Analysis  questions  posed  by  alternative  hypotheses  are  stated 
in  Table  6  on  page  44  and  Table  7  on  page  46. 


Table  6.  Main  Analysis  Questions 


Cell 

Comparison 

Task 

Measures 

Question 

Implication  of 
Significance 

Voice  (V) 

TT 

Do  scores  vary  between 
voices? 

Basic  intelli¬ 
gibility. 

Coding  (C) 

ST 

Do  scores  improve  with 
measure  of  infor¬ 
mation  coding? 

Efficacy  of 
navigation  aids 

TT 

Do  scores  improve  with 
measure  of  practice? 

Possible  practice 
effect  if  Cl  >  C2 

Speech 

Rate  (R) 

ST 

Are  search  scores 
less  with  faster 
rates? 

Rate  effects  on 
search  task 
performance 

TT 

Are  less  errors  made 
at  lower  rates? 

Rate  effects  on 
overall  intelligibility 

V  *  C 

ST 

Do  scores  vary  among 
combinations  of  voice 
type  and  coding  schemes? 

Search  efficacy 
of  different 
combinations 

TT 

Do  scores  vary  among 
combinations  of  voice 
type  and  coding  schemes? 

Effects  of  practice 
by  same  or 
different  voices 

V  *  R 

TT 

Do  scores  vary  among 
combinations  of  voice 
type  and  speech  rate? 

Differential 
intelligibility  as 
affected  by  rate 

ST 

Do  scores  vary  among 
combinations  of  voice 
type  and  speech  rate? 

Search  efficacy 
of  dii  ..  ent 
combinations 

C  *  R 

ST 

Do  scores  vary  among 
combinations  of  coding 
scheme  and  speech  rate? 

Search  efficacy 
of  different 
combinations 

TT 

Do  scores  vary  among 
combinations  of  coding 
scheme  and  speech  rate? 

Differential 
effects  of  rate 
on  practice 

Note:  TT  =  Transcriptive  Task  Scores;  ST  =  Search  Task  Scores 


Table  5.  Main  Analysis  Questions  (continued) 


Cell 

Comparison 

Task 

Measures 

Question 

Implication  of 
Significance 

V  *  C  *  R 

ST 

Do  scores  vary  among 
combinations  of  voice 
type,  coding  scheme, 
and  speech  rate? 

Search  efficacy 
of  unique 
combinations 

TT 

Do  scores  vary  among 
combinations  of  voice 
type,  coding  scheme, 
and  speech  rate? 

Practice  and 
intelligibility 
of  unique 
combinations 
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Table  7.  Post  Hoc  Analysis  Questions' 


Cell 

Task 

Question 

Implication  of 

Comparison 

Measures 

Significance 

Are 

TT 

Do  scores  for  one 

Practice  effect 

< 

p 

V 

< 

o 

condition  reflect 

of  same  voice 

and 

better  performance 

v2c,  >  VA 

than  another 

If  V,R,  =  V2R, 

TT 

Do  scores  for  one 

Differential 

or  VA  >  VA 

and 

condition  reflect 

intelligibility 

and 

ST 

better  performance 

of  voices  as 

VA  >  v2r2 

than  another? 

affected  by  rate 

or 

rate 

VA  <  VA 

if  C,R,  =  CA 

ST 

Do  scores  for  one 

Differential 

and 

combination  of 

effect  of  rate 

CA  >  CA 

coding  scheme  and 

on  search 

or 

speech  rate  reflect 

efficacy  (assuming 

CA  <  CA 

better  performance 

intelligibility 

than  another 

is  equal) 

Same  analysis  with  corresponding  comparisons 

assuming  CA 

-  CA 

V  *  C  *  R 

ST 

Are  one  or  more 

Search  efficacy 

combinations  of  voice 

of  unique 

type,  coding  scheme, 
>and  speech  rate? 
better  than  others? 

combinations 

Are  one  or  more 
combinations  of  voice 
type,  coding  scheme, 
and  speech  rate? 
better  than  others? 


Practice  and 
intelligibility 
of  unique 
combinations 


Results 


Both  performance  data  from  search  and  transcription  tasks  and  preference 
data  from  post-search  and  post-session  ratings  were  analyzed  using  descriptive  and 
inferential  statistics  with  data  analysis  results  of  p  <  0.05  considered  significant. 
Dependent  measures  and  data  collection  procedures  are  detailed  in  the  Methods 
Section.  Computer  files  of  subject  data  with  manually  inserted  transcription  scores 
(strict  and  synonym  scored)  were  input  to  a  data  reduction  package  with  reduced  data 
results  provided  in  Appendix  J.  Statistical  data  analysis  was  done  with  the  IBM  370 
mainframe  computer  at  Virginia  Tech  using  the  Statistical  Analysis  System  (SAS, 
1986). 

Search  Task  Data  Analysis 

A  three-way  multivariate  analysis  of  variance  (MANOVA)  for  factors  of  Voice 
Type.  Coding  Scheme  and  Speech  Rate  was  performed  for  dependent  measures  of 


3 


transcription  errors  (strict  and  synonym  scored),  target  search  time  ratios,  target 
search  efficiency  ratios,  and  invalid  keypresses  with  results  shown  in  Table  11  on 
page  52.  Conversion  of  Wilk's  U  criterion  to  familiar  F  values  was  used  (SAS,  1982) 
for  evaluating  overall  significance  of  effects.  Means  for  search  task  dependent 
measures  categorized  by  each  independent  variable  are  shown  in  Table  8  on  page 
49,  Table  9  on  page  50,  and  Table  10  on  page  51.  Speech  Rate  was  the  only  effect 
found  significant  for  search  task  measures,  F  (5,20)  =  3.88;  p  <  0.0128.  The  signif¬ 
icant  overall  effect  of  Speech  Rate  indicated  in  Table  11  was  not  reflected  for  Speech 
Rate  in  subsequent,  univariate  analyses  of  variance  as  shown  in  Table  12  on  page 

53,  Table  13  on  page  54,  and  Table  14  on  page  55. 

« 

Scores  for  target  search  time  ratios  ranged  from  0.26272  to  0.87358  with  a 
mean  of  0.659  or  65.9%  of  the  computer-simulated  expert  score  (see  Dependent 
Measures  and  Data  Collection  in  Methods  Section  for  dependent  measure  de¬ 
scription).  Target  search  efficiency  ratios  ranged  from  0.33333  to  0.88889  with  a  mean 
of  0.74.  Invalid  keypress  averages  ranged  from  0.0  to  0.3125  with  an  average  score 
of  0.029.  However,  only  8  subjects  made  invalid  keypresses  with  24  making  none. 
In  3  of  the  8  treatment  combination  cells,  no  errors  were  made  by  any  subject  (see 
Appendix  J  for  reduced  data  listings). 


Transcription  Task  Data  Analysis 


The  three-way  multivariate  analysis  of  variance  (MANOVA)  for  factors  of  Voice 
Type,  Coding  Scheme  and  Speech  Rate  included  analysis  of  information  message 
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Table  8.  Transcription  and  Search  Task  Dependent  Measure  Means  by  Voice  Type 
Search  Task  Measures 


Speech  Search  Time  Search  Efficiency  Invalid  Keypress 

Rate  Ratio  Ratio  Average 


Paul  0.67660750  0.75999687  0.01953125 

Betty  0.64040062  0.72306000  0.03906250 


Transcription  Task  Measures 


Speech 

Rate 


Strict 

Errors 


Synonym 

Errors 


Tabic  9.  Transcription  and  Search  Task  Dependent  Measure  Means  by  Coding  Scheme 
Search  Task  Measures 


Speech 

Search  Time 

Search  Efficiency 

Invalid  Keypress 

Rate 

Ratio 

Ratio 

Average 

Same 

0.63606062 

0.71705875 

0.03515625 

Alternating 

0.68094750 

0.76599812 

0.02343750 

Transcription  Task  Measures 

Speech 

Strict 

Synonym 

Rate 

Errors 

Errors 

Same 

8.8125 

6.8750 

Alternating 

8.6250 

6.4375 

Table  10.  Transcription  and  Search  Task  Dependent  Measure  Means  by  Speech  Rate 
Search  Task  Measures 


Speech 

Search  Time 

Search  Efficiency 

Invalid  Keypress 

Rate 

Ratio 

Ratio 

Average 

180  WPM 

0.68398250 

0.75117437 

0.02343750 

240  WPM 

0.63302562 

0.73188250 

0.03515625 

Transcription  Task  Measures 

Speech 

Strict 

Synonym 

Rate 

Errors 

Errors 

180  WPM 

6.6250 

4.7500 

240  WPM 

10.8125 

8.5265 
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Table  11.  MANOVA  Summary  Table  for  Voice  Type  x  Coding  Scheme  x  Speech  Rate  Using  Search 
and  Transcription  Task  Measures 


*  Approximation  of  F  obtained  by  conversion  using  Wilk's 
criterion  (SAS,  1986). 
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Table  12.  ANOVA  Summary  Table  for  Target  Search  Time  Ratios 


Source 

df 

SS 

F 

P 

Between  Subjects 

Voices 

(V) 

1 

0.01048750 

0./0 

0.4100 

Coding  Scheme 
(C) 

1 

0.01611865 

1.08 

0.3089 

Speech  Rate 

(R) 

1 

0.02077282 

1.39 

0.2495 

V  x  C 

1 

0.01381330 

0.93 

0.3454 

V  x  R 

1 

0.00501777 

0.34 

0.5673 

C  x  R 

1 

0.01016560 

0.68 

0.4172 

V  x  C  x  R 

1 

0.00164494 

0.11 

0.7427 

Subjects/ VCR 

24 

0.35793071 

Total 

31 

0.43595129 
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Table  13.  ANOVA  Summary  Table  for  Target  Search  Efficiency  Ratios 


Source 

df 

SS 

F 

P 

Between  Subjects 

Voices 

(V) 

1 

0.01091466 

0.93 

0.3440 

Coding  Scheme 
(C) 

1 

0.01916050 

1.64 

0.2132 

Speech  Rate 
(R) 

1 

0.00297741 

0.25 

0.6188 

V  x  C 

1 

0.00375130 

0.32 

0.5767 

V  x  R 

1 

0.00982416 

0.84 

0.3689 

C  x  R 

1 

0.00955826 

0.82 

0.3754 

VxCxR 

1 

0.00220564 

0.19 

0.6682 

Subjects/VCR 

24 

0.28115277 

Total 

31 

0.33954470 
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Table  14.  ANOVA  Summary  Table  (or  Invalid  Keypress  Averages 


Source 


df 


SS 


F  P 


Between  Subjects 


Voices 

(V) 

1 

0.00305176 

0.73 

0.4019 

Coding  Scheme 
(C) 

1 

0.00109863 

0.26 

0.6133 

Speech  Rate 

(R) 

1 

0.00109863 

0.26 

0.6133 

V  x  C 

1 

0.00012207 

0.03 

0.8659 

V  x  R 

1 

0.00982416 

0.84 

0.3689 

C  x  R 

1 

0.00598145 

1.43 

0.2439 

V  x  C  x  R 

1 

0.02062988 

4.92 

0.0362 

Subjects/VCR 

24 

0.10058594 

Total 


31 


0.33954470 


transcription  errors  obtained  under  strict  and  synonym  scoring.  Means  for  tran¬ 
scription  task  scores  are  also  found  in  Table  8  on  page  49,  Table  9  on  page  50,  and 
Table  10  on  page  51.  The  significant  overall  effect  for  Speech  Rate  found  in  the 
MANOVA  also  requires  further  analysis  of  transcription  task  dependent  measures. 
Significant  effects  of  speech  rate  were  found  in  subsequent,  univariate  analyses  of 
variance  as  shown  in  Table  15  on  page  57  and  Table  16  on  page  58  for  both  strict  and 
synonym  scoring.  Total  transcription  errors  per  subject  ranged  from  2  to  20  under 
strict  scoring  and  from  1  to  18  under  synonym  scoring.  Total  transcription  error 
means  of  8.719  (strict)  and  6.656  (synonym)  were  significantly  different  with  t  (31)  = 
8.69,  p  <  0.0001. 

Transcription  Error  Analysis  by  Sentence 

Because  of  observations  during  data  collection  and  calculation,  errors  by 
sentence  were  analyzed  in  detail.  Total  errors  made  by  sentence  are  depicted  in 
Figure  7  on  page  60  in  the  order  each  information  message  sentence  was  heard  by 
subjects.  Additionally,  the  number  of  subjects  missing  each  sentence  is  shown  in 
Figure  8  on  page  61.  Obviously,  sentence  8  and  11  resulted  in  more  errors  than 
others  with  more  subjects  making  errors  for  those  information  message  sentences 
than  others.  However,  the  strict  and  synonym  error  pattern  for  sentence  11  differ 
compared  to  that  of  sentence  8.  Detailed  review  of  errors  for  sentence  11  revealed 
18  of  the  42  strict  errors  resulted  from  subjects  substituting  the  word,  “samples”  for 
“samplers”  which  when  scored  under  synonym  rules  is  counted  as  correct.  If  errors 
from  these  sentences  were  deleted  from  the  total  message  transcription  errors  then 
total  error  means  would  be  5.031  (strict)  and  4.062  (synonym).  These  means,  like 
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Table  15.  ANOVA  Summary  Table  for  Message  Transcription  Errors  —  Strict  Scoring 


Source 

df 

SS 

F 

P 

Between  Subjects 

Voices 

(V) 

1 

5.28125 

0.35 

0.5589 

Coding  Scheme 
(C) 

1 

0.282125 

0.02 

0.8923 

Speech  Rate 
(R) 

1 

140.28125 

9.33 

0.0054 

V  x  C 

1 

0.28125 

0.02 

0.8923 

V  x  R 

1 

22.78125 

1.52 

0.2302 

C  x  R 

1 

2.53125 

0.17 

0.6852 

VxCxR 

1 

0.28125 

0.02 

0.8923 

Subjects/VCR 

24 

360.75 

Total 

31 

532.46875 
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Table  16.  ANOVA  Summary  Table  for  Message  Transcription  Errors  —  Synonym  Scoring 


Source 

df 

SS 

F 

P 

Between  Subjects 

Voices 

(V) 

1 

9.03125 

0.57 

0.4564 

Coding  Scheme 
(C) 

1 

1.53125 

0.10 

0.7580 

Speech  Rate 
<R) 

1 

116.28125 

7.38 

0.0120 

V  x  C 

1 

0.78125 

0.05 

0.8257 

V  x  R 

1 

11.28125 

0.72 

0.4059 

C  x  R 

1 

0.03125 

0.00 

0.9649 

V  x  C  x  R 

1 

0.03125 

0.00 

0.9649 

Subjects/VCR 

24 

378.25 

Total 

31 

517.21875 
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those  including  errors  from  sentences  8  and  11,  are  also  significantly  different  with  t 
(31)  =  5.16,  p  <  .0001. 

Errors  for  the  first  eight  sentences  were  also  compared  to  errors  for  the  last 
eight  sentences  to  assess  training  effects.  Results  were  significant  for  both  strict,  t 
(31)  =  4.714;  p  <  .0001,  and  synonym  scoring,  t  (31)  =  7.602;  p  <  .0001.  Means  for 
strict  scoring  data  were  5.531  for  the  first  8  sentences  and  3.188  for  the  last  8.  Means 
for  synonym  scoring  data  were  5.125  for  the  first  8  sentences  and  1.531  for  the  last 
8.  Because  of  these  findings,  difference  scores  between  the  first  and  last  8  sentences 
were  calculated  for  each  subject  (all  scores  were  in  the  same  direction)  and  analyzed 
using  a  three-factor  ANOVA  procedure.  As  shown  in  Table  17  on  page  62  and 
Table  18  on  page  63,  a  significant  effect  for  voice  was  found  for  both  strict  and  syno¬ 
nym  scoring  with  subjects  showing  greater  improvement  with  the  male  voice  (mean 
=  4.562)  than  the  female  (mean  =  2.625).  As  an  additional  comparison,  transcription 
score  means  reflected  as  percent  correct  are  shown  by  Voice  Type  in  Table  19  on 
page  64. 


Subjective  Measures 


Median  scores  were  computed  in  the  data  reduction  program  for  each  sub¬ 
ject's  transcription  certainty,  difficulty  in  understanding  the  information  message,  and 
difficulty  in  locating  the  store  item  (see  Appendix  I  for  rating  questions).  For  the 
seven  ratings  conducted  after  the  main  experimental  task  was  finished,  individual 
ratings  were  collected.  For  each  of  these  ten  ratings,  median  or  raw  scores  were 
analyzed  using  the  Mann-Whitney  U  test.  Each  test  evaluated  possible  differences 
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Table  17.  ANOVA  Summary  Table  for  First  8  -  Last  8  Sentence  Error  Differences  —  Strict  Scoring 


i 


Source 

df 

SS 

F 

P 

Between  Subjects 

Voices 

(V) 

1 

42.781 

5.944 

0.0226 

Coding  Scheme 
(C) 

1 

1.531 

0.213 

0.6488 

Speech  Rate 

(R) 

1 

16.531 

2.297 

0.1427 

V  x  C 

1 

0.781 

0.109 

0.7447 

V  x  R 

1 

5.281 

0.734 

0.4002 

C  x  R 

1 

5.281 

0.734 

0.4002 

V  x  C  x  R 

1 

0.281 

0.039 

0.845 

Subjects/ VCR 

24 

172.75 

Total 

31 

245.217 

Results 


Table  18.  ANOVA  Summary  Table  for  First  8  ■  Last  8  Sentence  Error  Differences  —  Synonym 
Scoring 


Source 

df 

SS 

F 

P 

Between  Subjects 

Voices 

(V) 

1 

30.031 

4.258 

0.05 

Coding  Scheme 
(C) 

1 

0.031 

0.004 

0.9475 

Speech  Rate 
(R) 

1 

16.531 

2.344 

0.1388 

V  x  C 

1 

0.281 

0.04 

0.8434 

V  x  R 

1 

1.531 

0.217 

0.6454 

C  x  R 

1 

3.781 

0.536 

0.4711 

V  x  C  x  R 

1 

0.281 

0.04 

0.8434 

Subjects/VCR 

24 

169.25 

Total 

31 

221.717 

Results 
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Table  19.  Mean  Percent  Correa  Scored  Words  by  Sentence  Groups 


All  Sentences  First  Eight  Last  Eight  First  Eight"  Last  Eight" 


Strict  Scored 

Paul  85.74 

Betty  87.01 


Synonym  Scored 

Paul  88.77 

Betty  90.43 


80.27 

85.16 


81.64 

86.33 


91.21 

88.87 


95.90 

94.53 


88.28 

91.80 


89.65 

92.97 


94.92 

93.56 


96.48 

95.51 


*  Without  errors  caused  by  sentences  8  and  11. 
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between  the  two  levels  of  each  factor  of  Voice  Type,  Coding  Scheme  and  Speech 
Rate.  The  only  significant  test  occurred  for  Speech  Rate  when  subjects  rated  speech 
rate  of  the  system.  Results  of  all  tests  are  summarized  in  Table  20  on  page  66.  A 
graphical  depiction  of  overall  subjective  ratings  for  speech  rate  as  well  subject  re¬ 
sponse  by  independent  variable  levels  is  shown  in  Figure  9  on  page  67  and 
Figure  10  on  page  68  respectively.  Overall  ratings  for  the  remaining  nine  scales  as 
well  as  ratings  by  each  independent  variable  level  are  depicted  in  Appendix  J. 
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Table  20.  Mann*Whitney  U  Valuea*  by  Factor  for  Each  Subjective  Rating  Scale 


Rating  Scale 

Voice  Type 

Coding  Scheme 

Speech  Rate 

Median  Scored 

Transcription  Certainty 

114 

124 

88 

Understanding  Difficulty 

99 

115 

89 

Locating  Difficulty 

128 

112 

112 

Raw  Scored 

Ease  of  Use 

125.5 

94 

109.5 

Voice(s)  Intelligibility 

118.5 

127 

84 

Voice(s)  Naturalness 

89 

114 

123 

Speech  Rate 

106 

109.5 

28  ** 

Response  Time 

123.5 

93 

107 

Input  Timeout 

86.5 

102 

124 

Menu  Organization 


97 


106 


117 
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Figure  10.  Speech  Rate  Ratings  by  Voice  Type,  Coding  Scheme  and  Speech  Rate 
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Performance  Results 


In  this  study,  alternative  hypotheses  in  the  form  of  questions  with  associated 
implications  of  significance  were  provided  as  a  framework  in  which  to  interpret  re¬ 
sults.  Consequently,  Table  6  on  page  44  and  Table  7  on  page  46,  which  contain 
these  questions  for  both  main  and  post  hoc  analyses,  will  guide  the  discussion. 

Voice 


Total  transcription  scores  did  not  vary  between  voices  for  either  strict  or  syn¬ 
onym  scoring.  However,  when  transcription  scores  were  analyzed  as  difference 
scores  between  the  first  eight  and  last  eight  sentences,  significant  effects  of  Voice 
Type  were  found  with  those  hearing  Paul  showing  more  improvement  in  the  last  eight 
sentences.  Each  of  these  findings  are  discussed  in  turn. 
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Researchers  have  often  reported  performance  measures  such  as  percent 
correct  for  a  voice  type  without  reporting  statistical  significance  of  their  findings.  The 
study  by  Green,  et  al.  (1984)  is  just  such  an  example.  An  exception  is  Pratt's  (1987) 
study  comparing  four  DECtalk  voices  (including  Paul  and  Betty)  to  other  synthesizers. 
Percentages  were  provided  as  in  other  studies  but  statistical  analyses  (ANOVA  and 
Newman-Keuls)  were  performed  on  preference  measures.  Since  statistical  signif¬ 
icance  of  Voice  Type  performance  differences  is  rarely  reported,  direct  comparison 
of  this  study's  lack  of  significant  difference  is  not  possible. 

However,  comparison  of  percentage  scores  is  possible.  Transcription  accu¬ 
racy  means  reported  in  percent  correct  (see  Table  19  on  page  64),  differ  slightly  in 
relative  magnitude  from  those  reported  in  the  literature.  Using  a  sentence  tran¬ 
scription  task  (Harvard  Psychoacoustic  Sentences)  analogous  to  one  used  in  this 
study,  Green,  et  at.  reported  percentages  of  95.3%  for  Paul's  voice  and  90.5%  for 
Betty's.  Results  found  in  this  study  (strict  scoring  for  comparison  with  Green,  et  at. 
(1984)  study)  show  85.74%  for  Paul  and  87.01%  for  Betty,  which  are  similar  perform¬ 
ance  levels  when  compared  to  Green,  et  at..  However,  this  comparison  and  others 
must  consider  at  least  three  differences  between  the  two  studies:  first,  Green,  et  at. 
used  an  eariier  version  of  the  DECtalk  speech  synthesizer  (DECtalk  version  1 .8  for  the 
Green,  et  al.  study  compared  to  the  DECtalk  version  2.0  used  in  this  one);  second,  the 
task  required  of  subjects  differed  substantially  between  the  two  studies  —  simple 
transcription  of  synthetically  spoken  sentences  (Green,  et  al.)  as  compared  to  the  in¬ 
tegrated  task  (search  and  transcription)  required  by  simulation  of  a  telephone  infor¬ 
mation  system;  finally,  lower  percentages  reported  in  this  study  probably  reflect 
scoring  of  the  four  most  difficult  words  in  the  sentence  as  compared  to  Green,  et 
a/.'s  procedures  of  scoring  all  words  in  a  sentence. 
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When  errors  are  analyzed  as  percentage  correct  for  first  eight  sentences 
heard  and  last  eight  sentences  heard,  interesting  performance  results  between 
voices  are  shown  (again,  see  Table  19  on  page  64).  Researchers  have  usually  re¬ 
ported  a  slight  performance  advantage  for  DECtalk's  Paul  voice  when  compared  to 
the  Betty  voice  (although  presence  or  lack  of  statistically  significant  differences  are 
rarely  repded  thus  limiting  the  power  and  extent  of  possible  comparisons).  How¬ 
ever,  the  percentage  correct  in  the  first  eight  sentences  for  those  hearing  Betty's 
voice  is  greater  than  for  those  hearing  Paul's  voice.  This  numerical  advantage  for 
Betty  disappears  in  the  last  eight  sentences  heard  with  those  hearing  Paul  averaging 
91.21%  correct  and  those  hearing  Betty,  averaging  88.87%. 

A  finding  not  previously  reported  in  literature  occurred  when  analysis  of  tran¬ 
scription  scores  divided  into  scores  for  first  eight  and  last  eight  yielded  a  significant 
difference  for  both  strict  and  synonym  scoring.  When  difference  scores  between  the 
first  and  last  eight  sentences  heard  were  analyzed,  a  significant  difference  for  Voice 
Type  was  found.  Those  subjects  hearing  Paul  trained  at  a  significantly  faster  rate  al¬ 
though  they  began  at  an  apparently  (no  significant  difference)  lower  level  of  per¬ 
formance  than  those  hearing  Betty.  Though  this  finding  demonstrated  the  effect  of 
training  in  synthetic  speech  reported  by  several  researchers  including  Schwab, 
Nusbaum  and  Pisoni  (1985),  Rosson  (1985),  and  Merva  and  Williges,  (1986),  none  have 
mentioned  differences  observed  by  Voice  Type. 

Coding  Scheme 

Search  task  scores  did  not  improve  (or  deteriorate)  by  using  an  alternating 
voice  coding  scheme  nor  did  transcription  task  scores  reveal  a  differential  practice 
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effect.  It  was  hypothesized  those  hearing  the  same  voice  would  have  more  practice 
with  that  voice  and  consequently  perform  better  on  the  transcription  task.  Continuing 
this  reasoning,  those  experiencing  the  alternating  voice  coding  scheme  would  have 
less  practice  with  the  voice  used  for  the  transcription  task  —  approximately  50%  less 
—  and  therefore  perform  poorly  when  compared  to  those  experiencing  the  same 
voice  coding  scheme.  Therefore,  the  training  effect  observed  for  synthetic  speech 
displays  appears  to  be  nonspecific  since  performance  improvement  occurs  even 
when  different  synthetic  voices  are  used  in  the  training  session.  Alternating  voice 
coding  schemes  were  also  intended  as  a  navigation  aid  enabling  subjects  to  track 
menu  levels  more  accurately.  Results  do  not  support  either  position,  though. 

In  fact,  little  research  exists  on  aids  for  auditory  database  navigation.  Calls  for 
using  navigation  aids  such  as  the  one  employed  in  this  study  are  based  more  so  on 
intuition  than  empirical  validation  (Kidd,  1982).  One  subject  provided  an  insight  to 
this  issue  during  the  debriefing  by  maintaining  he  had  heard  only  one  voice  even 
though  he  was  assigned  to  an  alternating  voice  condition.  Though  most  assigned  to 
alternating  voice  condition  acknowledged  hearing  two  voices,  many  did  not  think  this 
was  an  aid  to  database  navigation  with  some  unsure  of  the  pattern  of  voice  alter¬ 
ations.  Perhaps  instructing  subjects  on  the  alternating  voice  coding  scheme  would 
have  enhanced  its  effect.  Other  reasons  for  lack  of  significant  findings,  for  this  vari¬ 
able  are  considered  in  the  discussion  of  interaction  effects  and  post  hoc  analyses. 


Speech  Rate 

Speech  Rate  significantly  affected  both  search  task  and  transcription  overall 
task  performance  which  is  consistent  with  findings  from  previous  studies.  However, 
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a  more  focused  effect  for  Speech  Rate  was  not  detected  in  the  three  subsequent 
ANOVA  procedures  using  search  task  dependent  measures.  It  is  possible  for  a 
MANOVA  procedure  to  reveal  a  significant  effect  when  separate  ANOVAs  do  not.  This 
phenomenon  reflects  superior  experimental  power  of  the  MANOVA  procedure  over 
use  of  separate  ANOVAs  when  a  significant  effect  is  spread  across  more  than  one 
dependent  measure  (Finkelman,  Wolf,  and  Friend,  1977).  As  discussed  in  the  Litera¬ 
ture  Review  Section,  earlier  research  has  found  effects  of  Speech  Rate  to  be  at  least 
consistent  if  not  uniformly  significant.  And  overall  results  of  this  study  remain  con¬ 
sistent  with  findings  of  earlier  research.  Yet,  at  a  speech  rate  of  240  wpm,  intelli¬ 
gibility  of  synthetic  speech  does  not  seem  to  affect  search  and  transcription  tasks 
equally.  Transcription  task  measures  were  significant  for  both  MANOVA  and  ANOVA 
procedures  whereas  search  task  measures  were  not. 

A  possible  explanation  of  the  lack  of  focused  speech  rate  effects  on  search 
task  measures  comes  from  information  theory.  As  posed  by  Luce,  et  at.,  (1983), 
synthetic  speech  is  thought  to  increase  the  cognitive  load  on  the  listener  as  com¬ 
pared  to  comprehension  of  natural  speech.  Regardless  of  the  information  theory 
model  considered  (serial,  parallel  or  hybrid),  this  increased  cognitive  load  diminishes 
capacity  in  working  or  short-term  memory.  Increasing  speech  rate  should  further 
increase  the  high  cognitive  load  (as  compared  to  natural  speech)  imposed  by  syn¬ 
thetic  speech,  yet  no  differential  effect  of  Speech  Rate  was  observed  for  search  task 
measures.  Though  keywords  had  a  slightly,  shorter  pronunciation  duration  at  240 
wpm,  the  4-second  timeout  probably  enabled  subject  performance  comparable  to  that 
observed  at  180  wpm.  With  a  4-second  timeout  (provided  for  both  180  and  240  wpm 
conditions),  subjects  had  time  to  rehearse  and  comprehend  a  keyword  prior  to  the 
next  keyword  being  presented.  This  rehearsal  time  was  enough  to  overcome  the  di- 
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minished  cues  provided  by  an  assumed  poor  quality  speech  signal  presented  at  the 
high  rate  of  speed. 

Under  both  strict  and  synonym  scoring,  Speech  Rate  significantly  affected 
transcription  accuracy,  a  finding  well  established  in  the  literature.  However,  the 
contribution  of  rate  to  this  finding  may  not  be  just  a  function  of  rate.  A  majority  of 
subjects  during  debrief  described  an  interfering  effect  of  hearing  the  phrase,  “Begin 
Transcription”,  after  the  information  message.  Some  subjects  could  be  heard  re¬ 
peating  the  message  repeatedly  until  typing  it  into  the  computer.  One  subject  in  a 
240  wpm  condition  actually  began  typing  before  the  computer  terminal  display  had 
changed  as  a  strategy  to  preclude  forgetting  the  message  because  of  the  Begin 
Transcription  phrase.  To  borrow  again  from  information  theory,  this  phrase  interfered 
with  the  critical  role  of  rehearsal  required  to  maintain  information  in  short  term 
memory.  At  higher  speech  rates,  subjects  have  less  time  for  rehearsal  thus  in¬ 
creasing  capacity  demand  of  short  term  memory.  The  Begin  Transcription  phrase 
probably  caused  an  over  demand  or  overload  for  some  subjects'  short  term  memory. 

As  mentioned  in  the  Results  Section,  2  sentences  accounted  for  considerably 
more  errors  than  the  other  14  although  error  patterns  as  depicted  in  Figure  7  were 
different  between  sentences  8  and  11.  Sentence  8  contained  words  obviously 
unintelligible,  but  subjects  hearing  sentence  11  could  comprehend  the  meaning  if  not 
record  the  precise  words  spoken.  The  most  common  error  for  sentence  11  was 
substitution  of  the  word,  “samples”  for  “samplers".  A  limited  analysis  of  tran¬ 
scription  errors  which  discarded  errors  caused  by  sentences  8  and  11  revealed  little 
differences  between  earlier  analyses  containing  those  errors.  However,  implications 
for  a  designer  of  synthetic  speech  displays  are  clear  and  point  to  the  need  for  careful 
screening  of  messages  with  a  large  number  of  potential  users. 
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Interaction  Effects  and  Post  Hoc  Analyses 


The  same  question  was  posed  for  all  cell  comparisons:  do  scores  vary  among 
combinations  of  independent  variable  levels?  MANOVA  results  for  both  search  and 
transcription  task  dependent  measures  provided  a  negative  reply  to  this  question. 
Two  possible  reasons  exist  for  this  negative  reply:  first,  failure  to  reject  the  null  hy¬ 
potheses  suggest  these  independent  variables  hold  no  import  (statistical  or  prag¬ 
matic)  for  synthetic  speech  displays;  or  perhaps  these  issues  could  (or  do)  make  a 
difference  but  conduct  of  the  experimental  study  precluded  that  discovery.  For  the 
second  reason,  several  detailed  explanations  exist. 

Dependent  measures  used  in  this  study  could  possibly  have  been  insensitive 
to  additional  differences  caused  by  manipulation  of  independent  variables.  This  in¬ 
sensitivity  could  result  from  use  of  dependent  measures  inappropriate  to  the  de¬ 
pendent  variable  construct  being  measured.  Effects  of  independent  variables  on 
dependent  measures  such  as  search  time,  search  efficiency,  and  invalid  keypresses 
have  not  been  widely  explored.  Though  an  overall  effect  of  speech  rate  was  detected 
for  search  task  dependent  measures,  no  discrete  effects  (as  reflected  by  individual 
aNOVA  procedures)  were  revealed.  And  it  is  possible  search  task  measures  used  in 
this  study  were  not  sensitive  enough  to  detect  effects  of  voice  type  or  coding  scheme. 
The  dependent  measure  of  invalid  keypresses  exemplifies  this  viewpoint.  Out  of  32 
subjects,  24  never  made  an  invalid  keypress  with  4  subjects  making  one  invalid 
keypress,  3  subjects  making  2,  and  1  subject  making  5.  In  the  8  treatment  conditions, 
3  had  no  subjects  making  an  invalid  keypress  with  2  more  conditions  having  one 
subject  eacn. 


Discussion 


Another  reason  for  possible  insensitivity  of  dependent  measures  is  the  strong 
context  provided  by  the  department  store  setting.  Strong  contextual  clues  could  have 
masked  possible  aiding  or  debilitating  effects  of  the  independent  variables.  Though 
keyword  intelligibility  may  have  been  diminished,  the  hierarchical  relationship  of 
keywords  to  each  other  within  the  limits  of  a  department  store  settings  may  have 
provided  the  clues  needed  to  overcome  a  supposedly  poorer  speech  signal.  Evi¬ 
dence  for  this  view  comes  from  debriefing  comments  when  a  subject  explained  his 
search  strategy  as  being  a  “rule-out”  approach.  He  understood  one  keyword, 
“Household"  but  not  the  other,  “Fashion”.  Consequently,  he  chose  the  keyword, 
Fashion,  whenever  the  target  store  item  appeared  not  to  fit  under  the  category  of 
Household  (“ruling  out"  the  understood  keyword).  Such  a  strategy  indicated  use  of 
broad,  contextual  Jues. 

Finally,  training  provided  subjects  may  have  made  them  less  sensitive  to  var¬ 
iables  manipulated  in  the  study  and  hence,  the  dependent  measures  used  to  assess 
independent  variable  effects.  Subjects  were  provided  with  various  forms  of  training 
to  include  two  practice  runs.  This  procedure  resulted  from  preliminary  studies  out 
of  concern  that  errors  generated  from  the  first  several  searches  might  reflect  task 
uncertainty  as  opposed  to  effects  of  independent  variables.  Providing  thorough  in¬ 
structions  and  practice  was  intended  to  stabilize  measures,  not  mute  them.  Again, 
debriefing  comments  provide  some  support  as  several  subjects  said  they  understood 
the  task  after  the  tape  though  practice  runs  following  the  tape  were  helpful. 

If  independent  variable  effects  were  indeed  obscured  by  insensitive  depend¬ 
ent  measures,  several  corrections  could  be  made  based  on  reasoning  offered  here. 
First,  the  number  of  subjects  could  be  increased  resulting  in  a  more  powerful  test  by 
reducing  effects  of  subject  variability.  Secondly,  subject  training  could  be  diminished 
to  more  closely  resemble  naive  users  and  thus  possibly  render  the  dependent 
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measures  more  sensitive.  However,  careful  design  of  experimental  procedures 
would  be  necessary  to  preclude  measuring  task  uncertainty  as  opposed  to  the  true 
effects  of  the  paradigm's  independent  variables.  Finally,  by  decreasing  the  amount 
of  training,  context  familiarity  is  also  lowered  making  intelligibility  of  keywords  (and 
the  effects  of  independent  variables  on  them)  more  critical. 


Preference  Results 


Statistical  analysis  of  subjective  ratings  provided  only  one  significant  finding: 
those  subjects  assigned  to  different  Speech  Rate  conditions  rated  Speech  Rate  dif¬ 
ferently  and  reflected  the  condition  assigned  to  them.  Earlier  research  consistently 
supports  this  finding  making  Speech  Rate  a  pervasive  and  strong  factor  in  synthetic 
speech  perception.  No  further,  statistically  significant  differences  between  subject 
groups  (classified  by  independent  variable  levels)  were  found.  However,  in  absence 
of  performance  data  or  statistically  significant  data  of  any  kind,  preference  or  sub¬ 
jective  data  serve  designers  as  starting  points  for  field  trials.  Subjective  data  gath¬ 
ered  in  this  study  could  perform  the  same  function  for  a  telephone  information  system 
with  major  impressions  summarized  below. 

The  majority  of  subjective  ratings  provided  by  subjects  were  "favorable”  to 
the  system.  Most  subjects  tended  to  be  very  certain  about  their  transcription  accu¬ 
racy  though  ratings  were  not  as  high  for  understanding  the  information  message. 
High  ratings  given  to  locating  store  item  difficulty  reflect  study  results  of  no  significant 
differences  found  using  search  task  measures.  Also,  most  thought  the  information 
system  easy  to  use,  possibly  a  reflection  of  experimenter-provided  training  discussed 
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earlier  in  this  section.  Ratings  for  intelligibility  and  naturalness  show  some  of  the 
more  symmetrical  distribution  of  ratings  observed  with  the  overall  rating  for  natural¬ 
ness  resembling  a  normal  (Gaussian)  distribution  centered  on  a  rating  of  four.  Of  all 
ratings,  intelligibility  and  naturalness  seemed  to  be  rated  lower  than  other  dimen¬ 
sions.  Most  thought  system  response  time  was  very  fast  with  ample  time  (input 
timeout)  to  respond.  The  majority  rated  menu  organization  as  very  simple,  a  rating 
which  corresponds  with  subject  ratings  of  very  easy  in  difficulty  of  locating  store 
items. 


i 

! 

i 


Discussion 


78 


UWWBW1WLWWUW  WtftBMMMfftVT  WVT^.H  -->(■>«>  ’jexiwivav'ji-*-*  vs  vt 


Conclusions 


The  study  results  imply  the  following  guidelines  for  use  of  synthetic  speech  displays 

in  telephone  information  systems: 

•  Use  of  a  180  wpm  speech  rate  yields  better  transcription  accuracy  (intelligibility) 
as  compared  to  using  a  speech  rate  of  240  wpm. 

•  Use  of  different  speech  rates  significantly  affects  search  tasks  in  auditory  data¬ 
bases  though  precise  effects  are  not  yet  known.  Consequently,  though  designers 
of  synthetic  speech  displays  may  desire  acceleration  of  search  tasks,  use  of 
speech  rates  faster  than  180  wpm  needs  further  research. 

•  Users  are  both  aware  of  and  sensitive  to  speech  rate. 

•  When  applications  require  strict  or  precise  recall  of  spoken  utterances,  the  mes¬ 
sages  should  be  screened  by  a  sample  of  the  intended  user  population  to  ensure 
substitutions  are  absent  or  at  acceptable  levels. 

•  Although  using  one  voice  type  (male  as  opposed  to  female  and  as  represented 
by  DECtalk's  Perfect  Paul  and  Beautiful  Betty)  over  another  provides  no  statis¬ 
tically  significant  advantage,  designers  should  consider  training  time  available  to 
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users  as  those  using  the  male  synthetic  voice  improve  at  a  faster  rate  when 
compared  to  the  female  voice. 

•  Use  of  alternating  voices  as  a  navigation  aid  in  auditory  databases  provides  no 
apparent  benefit. 

•  Avoid  placing  phrases  not  part  of  an  information  node  immediately  following  an 
information  message  node.  Violation  of  this  principle  could  cause  interference 
in  a  user's  cognitive  rehearsal  —  a  process  necessary  for  short  term  memory 
retention. 

Future  research  in  using  synthetic  speech  displays  in  telephone  information 

systems  hold  many  questions  among  which  are  the  following: 

•  How  do  training  rates  between  male  and  female  voices  (as  represented  by  Paul 
and  Betty)  compare?  Do  listeners  of  Paul  continue  to  improve  at  a  faster  rate 
while  those  hearing  Betty  asymptote  in  their  performance?  Do  findings  support 
adaptive  rate  features  (user  selected  or  system  provided)? 

•  Does  the  midband  filter  function  inherent  in  telephone  communication  affect 
synthetic  speech  performance  in  ways  different  from  speech  heard  without  using 
a  telephone?  Does  synthetic  speech  performance  in  a  telephone  display  using 
previous  synthetic  speech  measures  (open  and  closed  MRT,  Haskins  and  Harvard 
sentences)  differ  from  previous  results? 

•  How  may  search  task  dependent  measures  be  rendered  more  sensitive  to  effects 
of  speech  rate  and  other  variables?  Do  larger  number  of  subjects  render  the 
same  dependent  measures  more  sensitive?  Would  field  studies  reveal  differ¬ 
ences  opposite  to  findings  of  laboratory  studies?  Would  search  task  dependent 
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measures  different  from  those  used  in  this  study  reflect  performance  differences 
for  independent  variables  used  in  this  study? 

•  How  does  synthetic  speech  rate  specifically  affect  search  tasks  in  auditory  data¬ 
bases?  Can  speech  rate  (or  the  achieved  effect  by  decreasing  keyword  pronun¬ 
ciation  duration  and  timeout  rate)  be  increased  for  menus  as  compared  to 
information  message  nodes? 

•  Do  database  organizations  other  than  the  formal,  hierarchical  structure  featured 
in  this  study  offer  better  performance?  For  example,  does  using  a  database 
containing  more  than  one  path  to  an  information  node  result  in  more  efficient 
searches? 

•  What  is  the  minimum  time  necessary  between  an  information  message  node  and 
subsequent  system  speech  in  order  to  prevent  interfering  with  short  term  mem¬ 
ory  retention  of  the  information  message? 

•  Are  users  different  where  synthetic  speech  is  concerned?  Does  performance  and 
preference  of  telephone  information  systems  employing  synthetic  speech  sys¬ 
tematically  vary  along  dimensions  of  the  users?  What  are  those  dimensions? 

Despite  its  coarticulation  problems  and  lack  of  sophisticated  prosody,  syn¬ 
thetic  speech  at  current  technological  levels  remains  a  viable,  auditory  display  for 
telephone  information  systems.  Much  research  is  needed  though,  on  auditory  data¬ 
base  construction  and  use  of  synthetic  speech  in  such  databases.  Research  rec¬ 
ommendations  provided  above  are  in  no  way  exhaustive  of  auditory  display  problems 
pertinent  to  telephone  information  studies. 
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Appendix  B.  Participant's  Informed  Consent  Form 


The  following  experiment  is  a  study  concerning  the  evaluation  of  a 
telephone-based  information  system.  During  the  experiment,  you  will  be  monitored 
with  a  closed-circuit  video  system.  As  a  participant  in  this  experiment,  you  have 
certain  rights  as  explained  below.  The  purpose  of  this  document  is  to  describe  these 
rights  and  to  obtain  your  written  consent  to  participate  in  the  experiment. 

1.  You  have  the  right  to  discontinue  your  participation  in  the  study  at  any  time  for 
any  reason.  If  you  decide  to  terminate  the  experiment,  inform  the  researcher  and 
he  will  pay  you  for  the  length  of  time  you  have  participated. 

2.  You  have  the  right  to  inspect  your  data  and  withdraw  it  from  the  experiment  if  you 
feel  that  you  should  for  any  reason.  In  general,  data  are  processed  and  analyzed 
after  a  subject  has  completed  the  experiment.  At  that  time,  all  identification  in¬ 
formation  will  be  removed  and  the  data  treated  with  anonymity.  Therefore,  if  you 
wish  to  withdraw  your  data,  you  must  do  so  immediately  after  your  participation 
is  completed. 

3.  You  hcve  the  right  to  be  informed  of  the  overall  results  of  the  experiment.  If  you 
wish  to  received  a  synopsis  of  the  results,  include  your  address  with  your  signa¬ 
ture  below.  If  after  receiving  the  synopsis,  you  would  like  more  indepth  infor¬ 
mation,  please  contact  Virginia  Tech's  Human  Computer  interaction  Laboratory 
and  a  full  report  will  be  made  available  to  you. 

This  research  is  funded  by  a  research  contract  with  the  National  Science 
Foundation.  The  co-principal  investigators  are  Dr.  Robert  Williges,  and  Ms.  Beverly 
Williges.  The  researcher  is  David  W.  Herlong.  He  can  be  contacted  at  the  following 
address  and  phone  number: 

Human  Computer  Interaction  Laboratory 
530  Whittemore  Hall 

Virginia  Polytechnic  Institute  and  State  University 
Blacksburg,  Virginia  24061 
(703)  961-4602 
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Further  comments  or  questions  can  be  addressed  to  Charles  Waring,  chair¬ 
man  of  the  Institutional  Review  Board  for  the  Use  of  Human  Subjects  in  research. 
He  can  be  contacted  at  the  address  and  the  phone  number  listed  below: 

Charles  Waring 

Office  of  Sponsored  Research  Programs 
301  Burruss  Hall 

Virginia  Polytechnic  Institute  and  State  University 
Blacksburg,  Virginia  24061 
(703)  961-5283 

If  you  have  any  questions  about  the  experiment  or  your  rights  as  a  participant, 
please  do  not  hesitate  to  ask.  The  researcher  will  do  his  best  to  answer  them,  sub¬ 
ject  only  to  the  constraint  that  he  does  not  pre-bias  the  experimental  results. 

Your  signature  below  indicates  that  you  have  read  and  understand  your  rights 
as  a  participant  (as  stated  above),  and  that  you  consent  to  participate. 


Participant's  Signature 


Witness'  Signature 


Print  your  name  and  address  if  you 
wish  to  receive  a  summary  of 
the  experimental  results. 
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Appendix  C.  Subject  Information  Questionnaire 


Age: _  Sex: _  Native  language: _ 

Please  list  any  hearing  impairments  you  may  have: 

For  the  following  questions,  please  circle  the  most  accurate  response: 

1.  How  experienced  are  you  with  using  computers? 

- 1 - 1 - 1 - 1 - 

No  experience  Some  experience  Experienced  Very  Experienced 

2.  How  experienced  are  you  with  using  information  systems? 

- 1 - j - | - 1 - 

No  experience  Some  experience  Experienced  Very  Experienced 

3.  How  experienced  are  you  with  listening  to  synthesized  speech? 

- 1 - 1 - 1 - 1 - 

No  experience  Some  experience  Experienced  Very  Experienced 
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Appendix  D.  Introduction 


Hello,  and  welcome  to  the  Human-Computer  Interaction  Lab.  Today,  you  have 
the  opportunity  to  participate  in  our  research  on  how  people  interact  with  talking 
computers. 

In  this  experiment,  you  will  try  to  find  information  on  certain  items  in  a  de¬ 
partment  store  (Hokie  Wholesale).  The  department  store  has  a  talking  computer  da¬ 
tabase  system  which  provides  shoppers  with  helpful  information  on  store  items. 
Shoppers  call  the  database  system  on  a  telephone  to  find  information  on  selected 
merchandise.  Similarly,  you  will  be  using  the  telephone  to  find  specific  information 
in  the  database.  The  talking  computer  may  sound  a  bit  strange  at  first,  but  we  are 
sure  you  will  soon  be  able  to  understand  everything  it  says.  The  computer  does  not 
understand  human  speech,  but  does  interpret  certain  key  presses  on  the  telephone 
keypad  as  commands. 

The  database  system  works  by  speaking  menus  of  keywords.  Keywords  are 
titles  for  a  group  of  related  items  (e.g.  automotive  is  a  keyword  for  a  group  of  items 
like  tires,  car  batteries,  and  motor  oil).  When  you  hear  a  keyword  which  most  closely 
relates  to  the  item  you  are  searching  for,  select  that  keyword  by  pressing  a  defined 
key  on  the  telephone  keypad.  The  system  will  then  speak  a  new  menu  of  keywords 
related  to  the  selected  keyword.  By  selecting  the  appropriate  keywords,  you  locate 
the  store  item  in  the  database.  Once  you  have  selected  the  store  item,  the  computer 
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will  speak  a  short  information  message  about  the  store  item.  This  message  will  have 
something  to  do  with  the  price,  location,  availability,  or  important  information  about 
the  store  item. 
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Appendix  E.  Instructions 


Your  task  is  to  search  for  information  on  store  items  in  the  department  store's 

talking  database.  Store  items  will  be  presented  as  targets  on  the  computer  display 

in  front  of  you.  You  will  find  the  target  by  using  the  telephone  keys  to  move  through 

the  talking  database. 

These  are  your  instructions: 

1.  Press  the  ON/OFF  key  on  the  telephone  keypad  and  listen  for  a  dial  tone. 

2.  Press  the  DIAL  key  on  the  telephone  keypad  (upper  right  corner). 

3.  The  talking  computer  will  answer  the  telephone  and  offer  you  instructions.  Press 
the  “#”  key  and  listen  carefully  to  the  instruction  for  using  the  telephone  keypad. 

4.  Read  the  first  target  on  the  computer  display  in  front  of  you. 

5.  Watch  the  computer  display.  It  will  signal  you  when  the  search  is  about  to  begin. 

6.  The  talking  computer  will  begin  speaking  a  menu  of  keywords.  Keywords  cate¬ 
gorize  groups  of  store  items.  After  each  keyword  is  spoken,  the  computer  will 
pause  briefly  to  allow  you  to  select  the  item.  If  you  do  not  select  the  item,  the 
computer  will  speak  another  keyword  for  that  menu. 

7.  To  locate  the  target,  select  a  keyword  from  the  menu  which  best  categorizes  the 
store  item  you  are  searching  for.  The  computer  will  then  speak  a  new  menu  of 
keywords,  based  on  your  selection.  If  you  need  to  hear  the  keypad  instructions 
again,  select  HELP  from  any  menu. 

8.  Continue  listening  to  menus  and  selecting  keywords  until  you  reacn  the  desired 
store  item. 

9.  When  you  hear  the  desired  store  item,  press  the  2  key  on  the  telephone  keypad 
and  listen  carefully  to  the  information  message. 

10.  The  computer  display  will  prompt  you  to  transcribe  what  you  heard. 

11.  Type  the  information  message  you  heard  into  the  computer,  and  press  the  RE¬ 
TURN  key. 

12.  Rate  the  certainty  of  your  transcription  being  correct  on  a  scale  of  1  (very  uncer¬ 
tain)  to  7  (very  certain),  and  press  the  RETURN  key. 
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13.  Rate  the  difficulty  of  understanding  the  message  on  a  scale  of  1  (very  difficult)  to 
7  (very  easy),  and  press  the  RETURN  key. 

14.  Rate  the  difficulty  of  locating  the  store  item  on  a  scale  of  1  (very  difficult)  to  7 
(very  easy),  and  press  the  RETURN  key. 

15.  Read  the  next  target  on  the  computer  display  and  get  ready  to  start  the  next 
search.  The  computer  display  will  signal  you  to  begin  the  next  search  and  will 
speak  the  first  item  in  the  main  menu.  Locate  the  next  target  and  transcribe  the 
information  message. 

16.  The  experiment  will  proceed  in  this  fashion.  You  wili  search  for  a  total  of  16  tar¬ 
gets. 

17.  The  computer  will  indicate  when  you  have  completed  the  experiment.  The  com¬ 
puter  display  will  then  request  that  you  rate  certain  characteristics  of  the  tele¬ 
phone  information  system.  The  meaning  of  each  characteristic  and  how  it  should 
be  rated  will  be  explained  on  the  computer  display. 

If  yn  j  have  any  questions,  please  ask  the  experimenter  now. 
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Appendix  F.  Subject's  Instructions 


The  video  instructions  you  just  watched  included  a  demonstration  of  how  the 


telephone  information  system  works  and  how  you  should  perform  the  task  for  this 


study.  The  actual  telephone  information  system  you  will  be  using  today  will  be  sim¬ 


ilar  to  the  system  in  the  video,  but  may  be  different  in  some  ways. 


These  are  the  commands  that  are  available  to  you  on  the  telephone  keypad: 
To  sele;t  an  item,  press  the  #  key. 


To  back-up  one  menu,  press  the  *  key. 


To  select  the  main  menu,  press  the  0  key. 


When  you  locate  the  store  item,  press  the  2  key  to  hear  the  information  message. 
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Appendix  G.  Database  Information  Targets  and 
Messages 


Message  type  indicated  in  parentheses:  (I)  =  Information,  (A)  =  Availability,  (P)  = 
Price,  and  (L)  =  Location. 

1.  Target:  What  is  the  information  message  for  laundry  washers? 

Information  message  heard:  Deluxe  models  are  available  with  green  trimming. 
(A) 

2.  Target:  What  is  the  information  message  for  football  books? 

Information  message  heard:  Faculty  discounts  are  offered  to  gym  teachers.  (I) 

3.  Target:  What  is  the  information  message  for  eye  mascara? 

Information  message  heard:  Travel  supplies  are  sold  for  $17.50.  (P) 

4.  Target:  What  is  the  information  message  for  men's  blazers? 

Information  message  heard:  Garment  bags  are  offered  with  new  purchases.  (I) 

5.  Target:  What  is  the  information  message  for  food  blenders? 

Information  message  heard:  Boxes  and  cartons  are  in  the  wrapping  center.  (L) 

6.  Target:  What  is  the  information  message  for  guitars? 

Information  message  heard:  Carrying  cases  are  reduced  by  55  to  63%.  (P) 

7.  Target:  What  is  the  information  message  for  pearl  necklaces? 

Information  message  heard:  Sorority  clasps  are  in  the  school  department.  (L) 

8.  Target:  What  is  the  information  message  for  hope  chests? 

Information  message  heard:  Walnut  stains  are  reduced  by  34  to  40%.  (P) 

9.  Target:  What  is  the  information  message  for  silk  blouses? 
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Information  message  heard:  Maternity  wear  is  near  ladies  lingerie.  (L) 

10.  Target:  What  is  the  information  message  for  compact  disc  recordings? 
Information  message  heard:  Head  cleaners  are  on  aisle  12.  (L) 

11.  Target:  What  is  the  information  message  for  women's  oriental  fragrances? 

Information  message  heard:  Manufacturer's  samplers  are  offered  to  interested 
shoppers.  (I) 

12.  Target:  What  is  the  information  message  for  men's  sweaters? 

Information  message  heard:  Rugby  letters  are  sold  for  $11.60.  (P) 

13.  Target:  What  is  the  information  message  for  knit  dresses? 

Information  message  heard:  Designer  collections  are  available  in  red  and  ivory. 
(A) 

14.  Target:  What  is  the  information  message  for  gold  chains? 

Information  message  heard:  Instant  financing  is  available  at  the  central  office.  (A) 

15.  Target:  What  is  the  information  message  for  recliner  chairs? 

Information  message  heard:  Leather  coverings  are  offered  to  wholesale  buyers. 

(I) 

16.  Target:  What  is  the  information  message  for  chicken  cookbooks? 

Information  message  heard:  Collector  editions  are  available  in  limited  quantities. 
(A) 
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Appendix  H.  Rating  Scales 


Individual  Target  Search  Ratings 


1.  Rate  how  certain  you  are  of  your  transcription  on  the  following  scale: 


1  2 
Very  Uncertain 


Very  Certain 


2.  Rate  how  difficult  it  was  to  understand  the  information  message  on  the  following 
scale: 

| - 1 - 1 - 1 - 1 - 1 - 1 


1  2 
Very  Difficult 


Very  Easy 


3.  Rate  how  difficult  it  was  to  locate  the  store  item  on  the  following  scale: 


1  2 
Very  Difficult 


Very  Easy 


Post-Experimental  Search  Ratings 


1.  Rate  the  ease  of  use  of  the  system  on  the  following  scale: 

| - 1 - 1 - 1 - 1 - 1 - 

1  2  3  4  5  6 

Very  Difficult 
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2.  Rate  the  intelligibility  of  the  computer  voice  on  the  following  scale: 


1  2  3  4  5  6  7 

Very  Unintelligible  Very  Intelligible 

3.  Rate  the  naturalness  of  the  computer  voice  on  the  following  scale: 


1  2 

3 

4 

5 

6 

7 

Very  Unnatural 

Very  Natural 

4.  Rate  how  fast  the  computer  talked  on  the  following  scale: 

I - 1 - 1 - ( - 1 - j - 1 

Very  Slow  Very  Fast 

5.  Rate  the  speed  at  which  the  system  responded  to  your  input  on  the  following 
scale: 


1  2  3  4  5  6  7 

Very  Slow  Very  Fast 

6.  Rate  the  amount  of  time  you  had  to  respond  on  the  following  scale: 


1  2  3  4  5  6 

Very  Little 


7.  Rate  the  menu  organization  on  the  following  scale: 

| - 1 - 1 - 1 - 1 - h 

1  2  3  4  5  6 

Very  Difficult 


7 

Very  Much 


7 

Very  Simple 
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Appendix  I.  Subject  Debrief 


1.  Do  you  like  the  idea  of  an  information  system  like  this  one? 

2.  Would  you  use  an  information  system  like  this  one? 

3.  What  applications  seem  appropriate  for  an  information  system  such  as  this  one? 

4.  What  improvements  would  you  suggest? 

5.  Overall,  did  you  like  (or  enjoy)  using  this  system?  : 

6.  What  information  would  you  like  to  add  to  the  instructions? 

7.  What  would  you  not  include  in  the  instructions? 

8.  Did  you  understand  the  commands? 

If  not: 

a.  Which  commands  confused  you? 

b.  What  did  you  understand  the  command  to  do? 

c.  How  did  the  execution  of  the  command  differ  from  your  expectations? 

9.  Are  there  any  commands  you  would  like  to  add? 

10.  Are  there  any  commands  you  would  like  to  eliminate? 

11.  What  command  would  you  use  to  restart  if  you  got  lost? 

12.  What  command  would  you  use  if  you  wanted  to  backup  one  category? 

13.  Do  you  think  you  understand  the  organization  of  the  data  base  well  enough  to  use 
the  system  comfortably? 

14.  Did  the  keyword  categories  confuse  you? 

15.  What  would  you  change  about  the  experimental  session? 

16.  Was  the  session  length  too  long? 

17.  Was  the  task  interesting  or  boring? 
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For  subjects  who  heard  alternating  voices: 

18.  Did  you  hear  more  than  one  type  of  voice? 

19.  Was  one  more  intelligible  than  the  other  (which  one)? 

20.  Was  one  more  natural  or  human  sounding  than  the  other? 

21.  Do  you  prefer  one  of  these  voices  over  the  other? 
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Percent  of  Subjects  Responding 
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Very  Uncertain 


Very  Certain 


H  All 
Conditions 


Figure  11.  Overall  Transcription  Certainty  Ratings 
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Voice  Type 


g»  so 


1  2  3  4  5  6  7 


Very  Uncertain  Very  Certain 


Paul 

Betty 


Coding  Scheme 


Same  Voice 
Alternating 


g 


'S 


1 

2 


Speech  Rate 


Very  Uncertain 


Very  Certain 


180  WPM 
240  WPM 


Figure  12.  Transcription  Certainty  Ratings  by  Voice  Type,  Coding  Scheme  and  Speech  Rate 
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Figura  13.  Overall  Undarttanding  Difficulty  Rating* 
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Appandix  J.  Parformanc*  and  Prafaranca  Data  Summary 


Percent  of  Subject*  Responding  Percent  of  Subjects  Understanding 
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Very  Easy 
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Same  Voice 
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Speech  Rate 
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0 

1  2  3  4  5  6  7 

Very  Difficult  Very  Easy 

Figure  14.  Understanding  Difficulty  Ratings  by  Voice  Type,  Coding  Scheme  and  Speech  Rate 
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Percent  of  Subjects  Responding 
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Figure  15.  Overall  Difficulty  In  Locating  Store  Item  Ratings 
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Vary  Difficult 


Voice  Typo 


1  2 
Vary  Difficult 


7 

Vary  Easy 


Coding  Scheme 


Sam*  Voice 
Alternating 


6  7 

Very  Easy 


Speech  Rate 


180  WPM 
240  WPM 


1  2  3  4  5  6  7 


Vary  Easy 


Figure  16.  Difficulty  In  Locating  Store  Item  Ratings  by  Voice  Type,  Coding  Scheme  and  Speech 
Rat* 
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Percent  o<  Subjects  Responding  Percent  ot  Subjects  Responding 
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Figure  18.  Ease  of  Use  Ratings  by  Voice  Type,  Coding  Scheme  and  Speech  Rate 
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Percent  of  Subjects 
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Figure  20.  Intelligibility  Retlnge  by  Voice  Type.  Coding  Scheme  and  Speech  Rate 
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Figure  22.  Naturalness  Rstings  by  Voice  Type,  Coding  Scheme  and  Speech  Rate 


Appendix  J,  Performance  and  Preference  Oata  Summary 


«3JVJ 


Voice  Type 


1  2 
Very  Slow 


6  7 

Very  Fast 


Coding  Scheme 


Same  Voice 
Alternating 


1  2 
Very  Slow 


7 

Very  Fast 


Speech  Rate 


01 

c 

60 

*o 

£ 

*> 

50 

K 

#• 

40 

i 

30 

f 

20 

o 

* 

e 

5 

10 

£ 

0 

180  WPM 
240  WPM 


1  2 
Very  Slow 


7 

Very  Fast 


Figure  24.  Response  Time  Ratings  by  Voice  Type,  Coding  Scheme  and  Speech  Rate 
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Figure  25.  Overall  Input  Timeout  Ratinge 
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Figure  27.  Overall  Menu  Organization  Ratings 
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Figure  28.  Menu  Organization  Ratings  by  Voice  Type.  Coding  Scheme  and  Speech  Rate 
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